Open aksj98 opened 1 year ago
The chunk size does not affect empirical results. Use the highest one that works for you! The higher it is the faster the training.
A few other factors affect the RAM like model size and sequence length, I think I was bottlenecked by one of them and hence had to go very low in chunk size.
Thanks Niklas, Had a quick question as well, I see you used a bunch of different LRs, what LR did you find to be the best? Did you also schedule the LRs in any way?
I didn't experiment extensively with the LRs - I think it's based on SentenceTransformer defaults. I found adjusting the LR alongside batch size works best. E.g. for bs=1024, I used 32e-5, so if your bs=512, I'd try 16e-5. If you go for 2048, I'd try 64e-5 etc., but searching over 2-3 values may be best.
It automatically uses a WarmupLinear
, see https://github.com/Muennighoff/sgpt/blob/9728de441b1dd2e638a8a64e1c83f77716f47d9a/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/SentenceTransformer.py#L616
Hi!
I saw that you have used lower chunksizes (2-4) in training of models, may I know why? I am sure 40GB of RAM in a GPU can handle more? Does it give better empirical results?
Thanks!