tensor parallelism across multiple GPU's

deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution

Apache License 2.0

196 stars 64 forks source link

Closed samanthvishwas closed 1 month ago

samanthvishwas commented 1 year ago

I am following the code as mentioned in the AWS documentation to host GPT-J-6B using DJL serving

[ https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb] Providing a tensor parallelism value as 2 in serving.properties creates 2 copies of the model rather than partitioning model layers across two GPU's . This happens irrespective of using a smaller/larger model.

Instance used : ml.g4dn.12xlarge

frankfliu commented 1 year ago

@samanthvishwas g4dn.12x has 4 gpus, DJLServing will automatically expend to all GPUs. So you see 2 workers (each worker use 2GPUs)

If you only want to load one copy of the model, you can set the following in serving.properties:

load_on_devices=0

sindhuvahinis commented 1 month ago

@samanthvishwas Closing the issue. Please open a new one, if you have any more questions.