UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Cached Losses way of working #2750

Open bely66 opened 2 months ago

bely66 commented 2 months ago

Hi Everyone,

Thank you for your continuous efforts, I'm really enjoying the new release.

One confusing concept for me is the CachedGISTEmbedLoss, CachedMultipleNegativesRankingLoss

From what I got I can increase the batch size from what my GPU memory can handle to larger batches.

What Happens is:

  1. When I increase the batch-size in the trainer i still get out of memory error
  2. When I increase the mini-batch size in the loss it works well

Which is a bit confusing, maybe there's a reference or something i missed but from the name the mini-batch already splits the original batch, so let's say I added the batch size to 16 in the trainer, how can the mini-batch be 1024 and still everything works

I thought it'll be overriding the original batch-size in the trainer, but honestly didn't get so far in the code.

Regards

tomaarsen commented 2 months ago

Hello!

Makes sense, the Cached losses can be a bit confusing. There's indeed 2 batch sizes now:

  1. the Training Argument batch size
  2. the mini-batch size in the loss.

Normally, the Training Argument batch size will be larger than the mini-batch size. The mini-batch size is also an upper limit: if you set the training argument batch size to 16 and the mini-batch size to 1024, then the real mini batch size fed to your GPU (or CPU) is just 16.

  1. When I increase the batch-size in the trainer i still get out of memory error

Increasing the Training Arguments batch size shouldn't increase the memory usage if you also have a smaller mini-batch size set. Could it be that the mini-batch size was rather large here?

  1. When I increase the mini-batch size in the loss it works well

This is a bit odd to me. Is this only when the Training Arguments batch size is small?

My recommendation would be to use a large Training Arguments batch size (e.g. 256, 512, 1024, 2048) and a small mini-batch size (e.g. 16). If that works, you can increase the mini-batch to make the training slightly faster. Do let me know if you still have issues with that setup.

bely66 commented 2 months ago

Hi @tomaarsen

That makes sense, i should try this

When i mentioned that when i make the mini-batch bigger the error doesn't happen, the per_gpu_batch was already small

bely66 commented 2 months ago

@tomaarsen

What happens when I use the cachedloss with a mixture of Triplets and Pairs datasets how does the in batch negatives are chosen:

  1. Are they choosen from the same dataset sampler proportion (if the batch size is 1024, and 20% is for this sampler so the in batches are from 20% of the 1024)

  2. Are they choosen for 1024 for each sampler of the pairs?

And for triplets i think it should ignore them, right?

tomaarsen commented 2 months ago

Good question. Every single batch will always only have samples from that one dataset. This is to avoid issues with multiple dataset formats being used with the same loss. So for pairs, for each "anchor" it will pick 1023 in-batch negatives (i.e. the "positive" from all other 1023 pairs). For triplets, it still uses in-batch negatives in addition to the "true" negative. So, for each "anchor" you have 1 hard negative, 1023 in-batch negatives from the "positive" from all other 1023 pairs, AND 1023 in-batch negatives from the "negative" from all other 1023 pairs. So you end up with 2047 negatives per anchor for triplets. If you don't want in-batch negatives and only use your 1 true negative (e.g. if the in-batch negatives are likely to be relevant for your anchors), then you can also use the TripletLoss which doesn't take any in-batch negatives.

If you want to do cross-dataset in-batch negative sampling, then you have to combine the datasets yourself before training (e.g. concatenate all pair datasets together).

I hope that is a bit clear.

bely66 commented 2 months ago

That's Amazing Thank you