Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.47k stars 3.29k forks source link

Gathering a list of strings from multiple devices using Fabric #20016

Open Haran71 opened 1 week ago

Haran71 commented 1 week ago

Bug description

I have a list of strings, on each device in multi-gpu evaluation, I want to be able to collect them all on all devices across all devices into a single list

m_preds = fabric.all_gather(all_preds) 
m_gt = fabric.all_gather(all_gt) 

when I try the above code (all_preds I and all_gt are lists of strings), m_preds and m_gt are the same lists as all_preds and all_gt as per the device their on. Am I doing something wrong?

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

cc @borda

awaelchli commented 1 week ago

Hey @Haran71

The documentation states:

Gather tensors or collections of tensors from multiple processes.

        This method needs to be called on all processes and the tensors need to have the same shape across all
        processes, otherwise your program will stall forever.

        Args:
            data: int, float, tensor of shape (batch, ...), or a (possibly nested) collection thereof.
            group: the process group to gather results from. Defaults to all processes (world).
            sync_grads: flag that allows users to synchronize gradients for the ``all_gather`` operation

        Return:
            A tensor of shape (world_size, batch, ...), or if the input was a collection
            the output will also be a collection with tensors of this shape. For the special case where
            world_size is 1, no additional dimension is added to the tensor(s).

It does not mention anywhere that strings are supported. The documentation states clearly this is meant to work for tensors. The reason why there is no error is because we want to support dictionaries. Perhaps the documentation could mention that explicitly.

If you have predictions you'd like to all-gather, I suggest to keep them as numbers/tensors, gather them, and then convert them to strings at the end.