Open sonic182 opened 5 months ago
Some advance from my side: I discovered torchrun and init_process_group("xla")
Hi sonic182,
Please let us know if you still see issues.
For background,
If your plan is not to use tensor_parallel, you can follow this https://github.com/aws-neuron/aws-neuron-samples-staging/blob/master/torch-neuronx/inference/hf_pretrained_clip_base_inference_on_inf2.ipynb, where openai/clip-vit-base-patch32
is readily supported. You can extend your model from there.
In your current model, ColumnParallelLinear
, RowParallelLinear
are used. It needs process group as it using tensor parallel https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tp_developer_guide.html?highlight=ColumnParallelLinear. With inf2, there are multiple devices https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inf2-arch.html?highlight=inf2#inf2-architecture, which can be used to shard the model to improve the inference performance.
Hi, I'm trying to make compatible a Clip model using neuron-distributed (because I'm gonna continue with a multimodal after it)
Currently in my notebook, insidea inf2.xlarge ubuntu 22, I have:
Then, when I try to load the pretrained clip with:
I'm getting this error:
The thing is ... I need a distributed process for inference?
And if so, how can I start it in a inf2 or trn* instance? I'm bit newbie with torch.distributed
Environment: