UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.97k stars 2.45k forks source link

Can't train a 7B-sized model on 80GB A100 #2802

Open orionw opened 3 months ago

orionw commented 3 months ago

Out of curiosity (and for a comparison to Tevatron) I tried running the MSMarco MRNL example and replaced the model with intfloat/e5-mistral-7b-instruct or meta-llama/Llama-2-7b-hf. Neither model could train due to OOM, even with a batch size of 1. This is with the default use_amp=True setting as well.

This was the same for me on a custom v3 SentenceTransformer training script. However, it seems like it should be possible to train them on a 80GB machine, since GritLM does. Perhaps I'm wrong though (cc @Muennighoff).

The HF Memory estimator also says it should work in about 50GB in FP16, but perhaps that doesn't hold for Triplet-style losses?

shuttie commented 3 months ago

My suggestion is to play with:

orionw commented 3 months ago

Thanks @shuttie! Unfortunately the example script already has a pretty short seq_len of 300 and as far as I know gradient checkpointing can't reduce the VRAM lower than a batch size of 1 -- the model has to be able to do a backwards pass with at least one instance.

Curious what @tomaarsen thinks and if he's looked at 7B models in v3.

tomaarsen commented 3 months ago

I haven't. Scheduling a job now to experiment & try to reproduce on another 80GB VRAM device.