Closed woojeongjin closed 4 years ago
I observe the same error. As a step to reproduce, run the following to train M4C (I tried on a machine w/ 2 GPUs):
mmf_run config=projects/m4c/configs/textvqa/defaults.yaml \
datasets=textvqa \
model=m4c \
run_type=train_val \
training.num_workers=4
and while running, change a few code that is used by a worker process:
cd your_mmf_package_root
# add some invalid syntax in the dataset file to simulate editing the code while running
echo "blablabla*&^%$@$" >> mmf/datasets/builders/textvqa/dataset.py
and wait a while.
The crash does not happen immediately. It happens roughly at the end of an epoch, where the DataLoader workers are joined and re-spawned. For M4C with the config above, it happens at iteration 271.
Error trace (the error trace seems to be printed from multiple processes, hence the duplicated lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
exitcode = _main(fd)
File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
self = reduction.pickle.load(from_parent) File "/private/home/ronghanghu/workspace/mmf/mmf/__init__.py", line 5, in <module>
File "/private/home/ronghanghu/workspace/mmf/mmf/__init__.py", line 5, in <module>
from mmf import utils, common, modules, datasets, models
File "/private/home/ronghanghu/workspace/mmf/mmf/common/__init__.py", line 3, in <module>
from mmf import utils, common, modules, datasets, models
File "/private/home/ronghanghu/workspace/mmf/mmf/common/__init__.py", line 3, in <module>
from .registry import registry
File "/private/home/ronghanghu/workspace/mmf/mmf/common/registry.py", line 409, in <module>
from .registry import registry
File "/private/home/ronghanghu/workspace/mmf/mmf/common/registry.py", line 409, in <module>
setup_imports()
File "/private/home/ronghanghu/workspace/mmf/mmf/utils/env.py", line 146, in setup_imports
setup_imports()
File "/private/home/ronghanghu/workspace/mmf/mmf/utils/env.py", line 146, in setup_imports
"mmf.datasets." + folder_name + "." + dataset_name + "." + module_name
File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/importlib/__init__.py", line 127, in import_module
"mmf.datasets." + folder_name + "." + dataset_name + "." + module_name
File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/conceptual_captions/__init__.py", line 9, in <module>
return _bootstrap._gcd_import(name[level:], package, level)
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/conceptual_captions/__init__.py", line 9, in <module>
from .builder import ConceptualCaptionsBuilder
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/conceptual_captions/builder.py", line 4, in <module>
from .builder import ConceptualCaptionsBuilder
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/conceptual_captions/builder.py", line 4, in <module>
from mmf.datasets.builders.coco import COCOBuilder
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/coco/__init__.py", line 4, in <module>
from mmf.datasets.builders.coco import COCOBuilder
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/coco/__init__.py", line 4, in <module>
from .builder import COCOBuilder
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/coco/builder.py", line 10, in <module>
from .builder import COCOBuilder
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/coco/builder.py", line 10, in <module>
from mmf.datasets.builders.textcaps.dataset import TextCapsDataset
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/textcaps/dataset.py", line 2, in <module>
from mmf.datasets.builders.textcaps.dataset import TextCapsDataset
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/textcaps/dataset.py", line 2, in <module>
from mmf.datasets.builders.textvqa.dataset import TextVQADataset
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/textvqa/dataset.py", line 230
blablabla*&^%$
^
SyntaxError: invalid syntax
from mmf.datasets.builders.textvqa.dataset import TextVQADataset
File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/textvqa/dataset.py", line 230
blablabla*&^%$
^
SyntaxError: invalid syntax
I have looked into this and seems there is an extent upto which we can fix this. Spawned processes will always reload code in case of dataloader. So, if you have something broken at the top level (import level, something which will throw an error as soon as you import), you will always see an error (which is more of multiprocessing issue rather than MMF). #374 aims to fix the issue on MMF side where we import all of the files.
General recommendation is to keep two copies of the repo, one for development and one for running jobs. Keep them in sync via GitHub or some other settings. Alternatively, it would make sense to copy the files before running the job. It might be worth looking dumbo kind of setup for this.
🐛 Bug
A code change affects other runs when num_workers > 0. If I change the code, then it seems a running file reimports python libraries. The imported python libraries are not fixed. You can simply test by the following steps.
First, run any command. Then make a code change; add "print("blahblah") to the text processor (e.g., BertTokenizer). At some point (right before validation or between training iterations), the running file will output "blahblah".
Command
To Reproduce
Steps to reproduce the behavior:
Expected behavior
If I make any changes, it should not affect other running files.
Environment
Please copy and paste the output from the environment collection script from PyTorch (or fill out the checklist below manually).
You can run the script with:
conda
,pip
, source):Additional context