OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.78k stars 2.25k forks source link

Multiple call to transforms warm_up() #2317

Open vince62s opened 1 year ago

vince62s commented 1 year ago

@Zenglinxiao When you implemented #1912 you added setstate / getstate logics for multiprocessing.

If I am not wrong and @anderleich / @panosk faced the same issue, here is what happening:

When using build_vocab.py there is a call to make_transforms() in the main process, and then we spawn n_threads. Because we pass the transforms created in main, the pickling/unpickling mechanism triggers another call to warm_up() in the __setstate__ hence we could avoid the first call to warm_up in the make_transforms. Even when we use n_threads=1 we spawn another process so same behavior.

When we train the story is a little different. If we use num_worker=0 the Dataloader is not used, everything is happening in the main process, hence calling warm_up is required somewhere (currently in the make_transforms of the build_dynamic_dataset_iter If num_worker>0then we fall back in the same situation as in build_vocab.

What do you think should be the best approach to avoid double warm_up (which is quite annoying for some transforms that loads big stuff)

cc @francoishernandez

Zenglinxiao commented 1 year ago

Hi @vince62s, Sorry for the late reply due to limited bandwidth.

The transforms will warm_up twice in the build_vocab step, but not in the training step. That was the case before version 3.0. We did not bother to optimize for build_vocab as it was not an urgent issue at that time.

The changes introduced in the release 3.0 switching to Dataloader makes things different: https://github.com/OpenNMT/OpenNMT-py/blob/fcc205ba43489315df265fedb39e91f08ed1f414/onmt/inputters/dynamic_iterator.py#L346-L351

The data_iter is already initialized with transform objects, since the num_workers will be passed to dataloader, it will spawn other processes for multiprocessing. Then transform objects will be picked and unpicked through __getstate__ and __setstate__ method during multiprocessing.

To mitigate this issue, we should only initialize/warm_up transform objects in the processes where they will be used to avoid pickling and unpicking.

One way we can try is to delay make_transform to _init_datasets to solve this. https://github.com/OpenNMT/OpenNMT-py/blob/fcc205ba43489315df265fedb39e91f08ed1f414/onmt/inputters/dynamic_iterator.py#L185

def build_dynamic_dataset_iter
...
    # transforms = make_transforms(opt, transforms_cls, vocabs)
    # delay this to DynamicDatasetIter._init_datasets
    corpora = get_corpora(opt, task)
    if corpora is None:
        assert task != CorpusTask.TRAIN, "only valid corpus is ignorable."
        return None
    data_iter = DynamicDatasetIter.from_opt(
        corpora, transforms_cls, # pass class rather than object
        vocabs, opt, task, copy=copy,
        stride=stride, offset=offset)
...
vince62s commented 1 year ago

I really thought your bandwidth was unlimited :) thanks for your feedback, appreciated.

There is an issue with your solution. We need to access "opt" to build the transforms from the cls. The way we build data_iter with its from_opt method makes it difficult (or not very natural) to redefine the full attribute opt to the DynalicDayasetIter

Instead what if we remove the warm_up from def make_transform() and we call it manually only in the case of num_workers=0 In all other cases, the set_state will do the job. Would that work ?