Is Multi-GPU loading whole data and model parallely in V3?

I am working on sentence-transforner V3 and using multi-gpu, using this command:

bash accelerate launch --multi-gpu --num_processes=4 main.py

Here is the code:

def main():
    # Define Accelerator
    accelerator = Accelerator(mixed_precision='fp16')
    print(f"Using GPU: {accelerator.num_processes}")

    # Existing language model
    word_embedding_model = models.Transformer('mixedbread-ai/mxbai-embed-large-v1')

    # use pool function over token embeddings
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

    # Join steps1 and steps2 using modules argument
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

    # Define train loss
    train_loss = losses.AnglELoss(model=model)

    dataset = load_json_data(raw_data)
    train_dataloader = DataLoader(dataset, batch_size=1, num_workers=8, collate_fn=collate_fn)

    num_epochs=30

    model.fit(train_objectives=[(train_dataloader, train_loss)],
              epochs=num_epochs,
              show_progress_bar=True)

if __name__ == '__main__':
    main()

Code is working fine and enabling multi-gpu.

I have a doubt that when I ran the code it printed "Using GPU: 4" four times and even data loading 4 times.

Does that mean the same model and whole data has been loaded into all GPUs simultaneously? What is the point of loading everything in all 4 GPUs? if I just want training to be done in GPUs parallely.

Please Explain.

Hello!

I have 2 comments:

Using GPU: 4 is to be expected as there are still 4 GPUs in total. It's equivalent to os.environ["WORLD_SIZE"], i.e. how many GPUs there are in total. You're likely interested in LOCAL_RANK instead, which will be 0, 1, 2, and 3 for each of the 4 processes respectively. (https://pytorch.org/docs/stable/elastic/run.html#definitions)
You don't have to instantiate the Accelerator yourself, that will be done for you automatically when you train.

And to my knowledge it's normal to load all data on all processes, then the Accelerator in the SentenceTransformerTrainer that gets made internally will tell each of the processes which batches from the full dataset to actually use.

Tom Aarsen

UKPLab / sentence-transformers

Is Multi-GPU loading whole data and model parallely in V3? #2692