Enable the use of multiple datasets

saattrupdan commented 1 year ago

This PR enables training ASR models on multiple datasets.

It does this by interleaving the streamed datasets. Say we have three datasets X, Y and Z, then we'll sample x_0, y_0, z_0, x_1, y_1, z_1, and so on. This is under the assumption that there's an equal sampling probability for the datasets, which can be modified manually as well.

Note that, during training, we will interleave the datasets until the largest dataset has been exhausted. This means that if we're sampling from a large dataset X and a small dataset Y, if we use equal sampling probability then Y will be sampled many times more than X. In that case it is therefore advisable to modify the sampling probabilities to sample X way more than Y, to mitigate this. During evaluation (i.e., the validation and test splits), we interleave until the smallest dataset has been exhausted, as we do not want to evaluate on the same sample more than once.

Example usage:

$ python src/scripts/finetune.py datasets=[nst_da,common_voice_da] dataset_probabilities=[0.95,0.05]

If dataset_probabilities is not set then it will default to equal probabilities for all datasets.

This closes #28.

saattrupdan commented 1 year ago

@sorenmulli

I could maybe use a slight explanation/motivation for interleaving until exhaustion in test/train - I am probably misunderstanding something.

In test; Why do we not continually sample from each dataset (with given weight) - why any need for handling exhausting? Are we using epoch learning? Then it makes sense and I missed it.

For evaluation, why not exhaust both datasets? Why stop when smallest is done?

My reasoning was that we want to evaluate the datasets as "equally as possible". So if we value the datasets equally, then they should equally contribute to the WER measurement. Otherwise we'd potentially end up in a situation where a model performs really well/badly because one of the dataset's evaluation split is way larger than the others.

If you have any counterpoints to that then I'm open to discuss 🙂

sorenmulli commented 1 year ago

@sorenmulli

I could maybe use a slight explanation/motivation for interleaving until exhaustion in test/train - I am probably misunderstanding something. In test; Why do we not continually sample from each dataset (with given weight) - why any need for handling exhausting? Are we using epoch learning? Then it makes sense and I missed it. For evaluation, why not exhaust both datasets? Why stop when smallest is done?

My reasoning was that we want to evaluate the datasets as "equally as possible". So if we value the datasets equally, then they should equally contribute to the WER measurement. Otherwise we'd potentially end up in a situation where a model performs really well/badly because one of the dataset's evaluation split is way larger than the others.

If you have any counterpoints to that then I'm open to discuss 🙂

No, I think this is a reasonable test approach. I guess if you want the full test results for each dataset in the mix, you then run separate testing for each dataset (and can average them however you want).

Also, I mistyped on the previous comment - my first batch of questions was for training; could you also elaborate a little on why we need to handle exhaust - do we keep track of epochs?

AJDERS commented 1 year ago

I think the strategy of continuing until the largest or smallest are exhausted in train and test respectively is reasonable. Especially for test it feels canonical. For training one has to handle size differences somehow, and i think many choices would be reasonable as long as the user understands what is going on; i do not think it is obvious what is going on, when only looking at the config . Maybe log a warning or info stating what is going on, in case dataset_probabilities are null? I am thinking of the case where one dataset is WAY larger than others so we might see samples from the smaller one many, many times, and an unexperienced user might not notice.

In regards to @sorenmulli comment, it could possibly be a nice-to-have convenience feature to be able to run each test in one go. Not sure it fits into this PR tho.

saattrupdan commented 1 year ago

@sorenmulli

Also, I mistyped on the previous comment - my first batch of questions was for training; could you also elaborate a little on why we need to handle exhaust - do we keep track of epochs?

Ah, the idea there was basically that we want to use all the training samples that we have available. If we stop when the first dataset is exhausted then we lose out on a lot of samples in the other datasets. This makes it possible to, e.g., train on 1 epoch of NST-da and 2 on common-voice-da, for instance. This is relatively common when training language models, that the models are trained on more epochs of the smaller datasets than the big ones.

saattrupdan commented 1 year ago

Maybe log a warning or info stating what is going on, in case dataset_probabilities are null? I am thinking of the case where one dataset is WAY larger than others so we might see samples from the smaller one many, many times, and an unexperienced user might not notice.

Added a logging message now.

alexandrainst / coral

Enable the use of multiple datasets #29