Open vince62s opened 1 year ago
Hi @vince62s, Sorry for the late reply due to limited bandwidth.
The transforms will warm_up twice in the build_vocab step, but not in the training step. That was the case before version 3.0. We did not bother to optimize for build_vocab as it was not an urgent issue at that time.
The changes introduced in the release 3.0 switching to Dataloader makes things different: https://github.com/OpenNMT/OpenNMT-py/blob/fcc205ba43489315df265fedb39e91f08ed1f414/onmt/inputters/dynamic_iterator.py#L346-L351
The data_iter
is already initialized with transform objects, since the num_workers
will be passed to dataloader, it will spawn other processes for multiprocessing. Then transform objects will be picked and unpicked through __getstate__
and __setstate__
method during multiprocessing.
To mitigate this issue, we should only initialize/warm_up transform objects in the processes where they will be used to avoid pickling and unpicking.
One way we can try is to delay make_transform
to _init_datasets
to solve this. https://github.com/OpenNMT/OpenNMT-py/blob/fcc205ba43489315df265fedb39e91f08ed1f414/onmt/inputters/dynamic_iterator.py#L185
def build_dynamic_dataset_iter
...
# transforms = make_transforms(opt, transforms_cls, vocabs)
# delay this to DynamicDatasetIter._init_datasets
corpora = get_corpora(opt, task)
if corpora is None:
assert task != CorpusTask.TRAIN, "only valid corpus is ignorable."
return None
data_iter = DynamicDatasetIter.from_opt(
corpora, transforms_cls, # pass class rather than object
vocabs, opt, task, copy=copy,
stride=stride, offset=offset)
...
I really thought your bandwidth was unlimited :) thanks for your feedback, appreciated.
There is an issue with your solution. We need to access "opt" to build the transforms from the cls.
The way we build data_iter
with its from_opt
method makes it difficult (or not very natural) to redefine the full attribute opt to the DynalicDayasetIter
Instead what if we remove the warm_up
from def make_transform()
and we call it manually only in the case of num_workers=0
In all other cases, the set_state
will do the job.
Would that work ?
@Zenglinxiao When you implemented #1912 you added setstate / getstate logics for multiprocessing.
If I am not wrong and @anderleich / @panosk faced the same issue, here is what happening:
When using build_vocab.py there is a call to make_transforms() in the main process, and then we spawn n_threads. Because we pass the
transforms
created in main, the pickling/unpickling mechanism triggers another call to warm_up() in the__setstate__
hence we could avoid the first call towarm_up
in themake_transforms
. Even when we usen_threads=1
we spawn another process so same behavior.When we train the story is a little different. If we use
num_worker=0
the Dataloader is not used, everything is happening in the main process, hence callingwarm_up
is required somewhere (currently in themake_transforms
of thebuild_dynamic_dataset_iter
Ifnum_worker>0
then we fall back in the same situation as in build_vocab.What do you think should be the best approach to avoid double
warm_up
(which is quite annoying for some transforms that loads big stuff)cc @francoishernandez