UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.27k stars 2.47k forks source link

Getting MultipleNegativeRanking loss right #745

Open datistiquo opened 3 years ago

datistiquo commented 3 years ago

I desperately need to try out this loss but in the current way it is impossible because it does not take care of batch structring. I want to write a script to train this loss and exclude multiple positives of the same class within the batch. Because the loss expects at most one class examples, as all others serves as a negative.

Problem is that I cannot find any examples about pytorch datasets/dataloaders where I can structure my batch as I want with specific examples. Aldo, up to now I also have troubles to understand how actually the dataloader interacts with the dataset to strcuture the batch. So I even do not know where to start...

@nreimers Do you might have a minimal code snippet or just where i have to start? At another issue you said I should write a custom dataset or custom dataloader, but the latter I think is a synonym to the first, and I cannot find anyting how to do that: It is really strange about pytorch and data, because everyone means by custom dataloader the standard dataloader with a custom dataset. So I think this is all very new stuff.

nreimers commented 3 years ago

Have a look here on the various options pytorch is providing: https://pytorch.org/docs/stable/data.html

datistiquo commented 3 years ago

@nreimers Could you share your insights where in this link is the key point you mean? I do not want to use any option as this does not solve my issue, right?! Sadly this is only for standard usages, and does not show how i can customize batches by eg having just a single class label inside the batch?

I think just posting this does not help as I assume my purpose is more complicate to solve... Or maybe I am totally on the wrong track? That is why I ask.

EDIT: I think I found a possible starting point after several hours.There intersting things like Samplers... And they should be used to customize batches (and not datasets or dataloaders). Maybe it was too obvious... Now, I hope I can made it with this information. :)

datistiquo commented 3 years ago

Oh very sad: https://github.com/pytorch/pytorch/issues/28743

nreimers commented 3 years ago

You need a map style dataset (you can just use a list with the InputExamples) and provide a batch_sampler function to the DataLoader. The batch_sampler returns one specific batch that is used for training

datistiquo commented 3 years ago

Yes, I already find this out and currently try something. Maybe I will post it here.

But constructing this is more involved that I thought... For just picking classes and single samples randomly per batch is straight forward. But now, I try to ensure that each example will be in a batch. So I like to fill all batches sequentially with samples of each class. Then for the next batch I want to pick the next element of a class and so on in cyclic manner.

Batch size needs to be smaller or equal to the number of available classes. Even counting the total number of batches needed is more involved. Just using total_samples/batch_size is not suitable. For example, let' say you want that each sample of each class should appear in a batch during the epoch, then having 100 examples of one class somehow needs to have also 100 batches of course... So there is a lot of stuff to think about.

Is it possible with your framework out of the box to train with batches that have unequal size? So maybe you could also have batches with different sizes...

What are your experience with that @nreimers ? Or did you just used this loss with out any large customizing in any of your use cases?