Can't train a 7B-sized model on 80GB A100

orionw commented 5 months ago

Out of curiosity (and for a comparison to Tevatron) I tried running the MSMarco MRNL example and replaced the model with intfloat/e5-mistral-7b-instruct or meta-llama/Llama-2-7b-hf. Neither model could train due to OOM, even with a batch size of 1. This is with the default use_amp=True setting as well.

This was the same for me on a custom v3 SentenceTransformer training script. However, it seems like it should be possible to train them on a 80GB machine, since GritLM does. Perhaps I'm wrong though (cc @Muennighoff).

The HF Memory estimator also says it should work in about 50GB in FP16, but perhaps that doesn't hold for Triplet-style losses?

shuttie commented 5 months ago

My suggestion is to play with:

gradient checkpointing: it's a real VRAM saver, please make sure it's enabled.
try a shorter seq_len - it affects VRAM usage, but on bs=1 I guess not that much.

orionw commented 5 months ago

Thanks @shuttie! Unfortunately the example script already has a pretty short seq_len of 300 and as far as I know gradient checkpointing can't reduce the VRAM lower than a batch size of 1 -- the model has to be able to do a backwards pass with at least one instance.

Curious what @tomaarsen thinks and if he's looked at 7B models in v3.

tomaarsen commented 5 months ago

I haven't. Scheduling a job now to experiment & try to reproduce on another 80GB VRAM device.

Tom Aarsen

UKPLab / sentence-transformers

Can't train a 7B-sized model on 80GB A100 #2802