allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.71k stars 2.24k forks source link

When 'instances_per_epoch' is set up in the class MultiTaskDataLoader, the function __len__ in it will return a wrong answer. #5732

Closed wsmgh closed 1 year ago

wsmgh commented 1 year ago

Checklist

Description

When i perform multi-task learning with allennlp, i config the MultiTaskDataLoader as following: image I set 'instances_per_epoch' to 8000 and 'batch_size' to 16. I expect that there are about 500 steps in an epoch. However, when i run my codes, the process bar shows that there are 3000 steps. But actully, there isn't that much. After 502 steps, the epoch completed. image

After checking, i find that the following codes in MultiTaskDataLoader is wrong: image

From the __init__ function in MultiTaskDataLoader, we can know that when 'instances_per_epoch' is set, the sampler will also be provided. image

So, when we count instances for each dataset, we should take into consideration the proportion of each dataset provided by the sampler. Thus, the aforementioned wrong codes should be replaced by the following codes: image

Here is the codes: `

    dataset_proportions = self.sampler.get_task_proportions(self._loaders)

    proportion_sum = sum(dataset_proportions.values())

    num_instances_per_dataset = {

        key: math.floor(proportion * self._instances_per_epoch / proportion_sum)

        for key, proportion in dataset_proportions.items()

    }

`

Python traceback:

``` ```

Related issues or possible duplicates

Environment

OS: Linux

Python version: 3.7.13 Allennlp version: 2.10.1

Output of pip freeze:

``` ```

Steps to reproduce

Example source:

``` ```

github-actions[bot] commented 1 year ago

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇