Open orionw opened 5 months ago
My suggestion is to play with:
seq_len
- it affects VRAM usage, but on bs=1 I guess not that much.Thanks @shuttie! Unfortunately the example script already has a pretty short seq_len of 300 and as far as I know gradient checkpointing can't reduce the VRAM lower than a batch size of 1 -- the model has to be able to do a backwards pass with at least one instance.
Curious what @tomaarsen thinks and if he's looked at 7B models in v3.
I haven't. Scheduling a job now to experiment & try to reproduce on another 80GB VRAM device.
Out of curiosity (and for a comparison to Tevatron) I tried running the MSMarco MRNL example and replaced the model with
intfloat/e5-mistral-7b-instruct
ormeta-llama/Llama-2-7b-hf
. Neither model could train due to OOM, even with a batch size of 1. This is with the defaultuse_amp=True
setting as well.This was the same for me on a custom v3 SentenceTransformer training script. However, it seems like it should be possible to train them on a 80GB machine, since GritLM does. Perhaps I'm wrong though (cc @Muennighoff).
The HF Memory estimator also says it should work in about 50GB in FP16, but perhaps that doesn't hold for Triplet-style losses?