DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.17k stars 135 forks source link

Questions about LAML procedure implementation #158

Closed AlexSteveChungAlvarez closed 10 months ago

AlexSteveChungAlvarez commented 10 months ago

Hello @Flux9665 ! I hope you are doing well. I have been reading your code in order to understand the LAML procedure implementation for training you did. I have a question on these lines of code:

      while len(batches) < batch_size:
            for index in random.sample(list(range(len(datasets))), len(datasets)):
                if len(batches) < batch_size:
                    # we get one batch for each task (i.e. language in this case) in a randomized order
                    try:
                        batch = next(train_iters[index])
                        batches.append(batch)
                    except StopIteration:
                        train_iters[index] = iter(train_loaders[index])
                        batch = next(train_iters[index])
                        batches.append(batch)

Here you don't ensure that all languages have the same amount of samples in a batch, for example if the batch size was 32 and you got 3 languages, the "batches" array would have 11 samples of each of the first 2 languages sampled randomly and 10 of the third. Is this OK according to this procedure? I thought the goal was to have the same amount of samples of each language per batch. My other question is: What is the goal of a random order of the languages in the batch? I think it would achieve the same result staying as the array "datasets" is since languages would be distributed equally. By now this wouldn't happen unless you choose a batch_size multiple of the number of languages. I will be waiting for your answers to this theoretical questions!

lbehringer commented 10 months ago

Hi @AlexSteveChungAlvarez.

As far as I understand, the random sampling from datasets is used exactly for the case where batch_size % number_of_languages != 0 to minimize the imbalance of samples from each dataset across iterations.

Given your example with batch size 32 and 3 languages, for each batch you will get 10 samples of all 3 languages, then 2 of the 3 languages will randomly be selected to complete the batch.

While this still means that any single batch doesn't contain an equal amount of samples from all languages, it should even out across all batches.

Without using random, you would indeed always get 11 samples of the first 2 datasets in your training pipeline, and only 10 samples from the third dataset.

AlexSteveChungAlvarez commented 10 months ago

Given your example with batch size 32 and 3 languages, for each batch you will get 10 samples of all 3 languages, then 2 of the 3 languages will randomly be selected to complete the batch.

Then should the number of steps be multiple of the number of languages? So it can be possible that the number of samples even out across all batches?

While this still means that any single batch doesn't contain an equal amount of samples from all languages, it should even out across all batches.

Following the same example, for this to happen, there should be 3 rounds/steps so it can be posible to get 32 samples for each of the 3 languages. Then the total steps should be multiple of 3 so it's possible to achieve the same number of samples for each language, shouldn't it?

lbehringer commented 10 months ago

In theory, yes to both points, however you can't guarantee that random.sample() will give you a perfectly equal distribution after a specified number of steps. You would need a different algorithm for that.

In practice, over thousands or even millions of steps, I doubt it makes a difference.

Flux9665 commented 9 months ago

Great explanation by Lyonel, just one thing I want to add for completeness: Originally the LAML procedure was doing actual model agnostic meta learning. But over time we found that we can simplify the procedure until now at this point it is just multi task learning and no longer really close to the original MAML procedure. So there are some discrepancies between what the paper describes and what is in the code now. The version in the code is simpler, faster and works just as well.