Closed samanthvishwas closed 1 month ago
@samanthvishwas g4dn.12x has 4 gpus, DJLServing will automatically expend to all GPUs. So you see 2 workers (each worker use 2GPUs)
If you only want to load one copy of the model, you can set the following in serving.properties
:
load_on_devices=0
@samanthvishwas Closing the issue. Please open a new one, if you have any more questions.
I am following the code as mentioned in the AWS documentation to host GPT-J-6B using DJL serving
[ https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb] Providing a tensor parallelism value as 2 in serving.properties creates 2 copies of the model rather than partitioning model layers across two GPU's . This happens irrespective of using a smaller/larger model.
Instance used : ml.g4dn.12xlarge