Why use low chunksizes?

Muennighoff / sgpt

SGPT: GPT Sentence Embeddings for Semantic Search

https://arxiv.org/abs/2202.08904

MIT License

841 stars 51 forks source link

Why use low chunksizes? #39

Open aksj98 opened 1 year ago

aksj98 commented 1 year ago

Hi!

I saw that you have used lower chunksizes (2-4) in training of models, may I know why? I am sure 40GB of RAM in a GPU can handle more? Does it give better empirical results?

Thanks!

Muennighoff commented 1 year ago

The chunk size does not affect empirical results. Use the highest one that works for you! The higher it is the faster the training.

A few other factors affect the RAM like model size and sequence length, I think I was bottlenecked by one of them and hence had to go very low in chunk size.

aksj98 commented 1 year ago

Thanks Niklas, Had a quick question as well, I see you used a bunch of different LRs, what LR did you find to be the best? Did you also schedule the LRs in any way?

Muennighoff commented 1 year ago

I didn't experiment extensively with the LRs - I think it's based on SentenceTransformer defaults. I found adjusting the LR alongside batch size works best. E.g. for bs=1024, I used 32e-5, so if your bs=512, I'd try 16e-5. If you go for 2048, I'd try 64e-5 etc., but searching over 2-3 values may be best.

It automatically uses a WarmupLinear, see https://github.com/Muennighoff/sgpt/blob/9728de441b1dd2e638a8a64e1c83f77716f47d9a/biencoder/nli_msmarco/sentence-transformers/sentence_transformers/SentenceTransformer.py#L616