A code change affects other runs.

woojeongjin commented 4 years ago

🐛 Bug

A code change affects other runs when num_workers > 0. If I change the code, then it seems a running file reimports python libraries. The imported python libraries are not fixed. You can simply test by the following steps.

First, run any command. Then make a code change; add "print("blahblah") to the text processor (e.g., BertTokenizer). At some point (right before validation or between training iterations), the running file will output "blahblah".

Command

To Reproduce

Steps to reproduce the behavior:

Any command. I used Hateful memes dataset and Visual Bert.

Expected behavior

If I make any changes, it should not affect other running files.

Environment

Please copy and paste the output from the environment collection script from PyTorch (or fill out the checklist below manually).

You can run the script with:

# For security purposes, please check the contents of collect_env.py before running it.
python -m torch.utils.collect_env

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

ronghanghu commented 4 years ago

I observe the same error. As a step to reproduce, run the following to train M4C (I tried on a machine w/ 2 GPUs):

mmf_run config=projects/m4c/configs/textvqa/defaults.yaml \
    datasets=textvqa \
    model=m4c \
    run_type=train_val \
    training.num_workers=4

and while running, change a few code that is used by a worker process:

cd your_mmf_package_root
# add some invalid syntax in the dataset file to simulate editing the code while running
echo "blablabla*&^%$@$" >> mmf/datasets/builders/textvqa/dataset.py

and wait a while.

The crash does not happen immediately. It happens roughly at the end of an epoch, where the DataLoader workers are joined and re-spawned. For M4C with the config above, it happens at iteration 271.

Error trace (the error trace seems to be printed from multiple processes, hence the duplicated lines):

Traceback (most recent call last):
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
  File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
      File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
exitcode = _main(fd)
  File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
self = reduction.pickle.load(from_parent)  File "/private/home/ronghanghu/workspace/mmf/mmf/__init__.py", line 5, in <module>

  File "/private/home/ronghanghu/workspace/mmf/mmf/__init__.py", line 5, in <module>
    from mmf import utils, common, modules, datasets, models
  File "/private/home/ronghanghu/workspace/mmf/mmf/common/__init__.py", line 3, in <module>
    from mmf import utils, common, modules, datasets, models
  File "/private/home/ronghanghu/workspace/mmf/mmf/common/__init__.py", line 3, in <module>
    from .registry import registry
  File "/private/home/ronghanghu/workspace/mmf/mmf/common/registry.py", line 409, in <module>
    from .registry import registry
  File "/private/home/ronghanghu/workspace/mmf/mmf/common/registry.py", line 409, in <module>
    setup_imports()
  File "/private/home/ronghanghu/workspace/mmf/mmf/utils/env.py", line 146, in setup_imports
    setup_imports()
  File "/private/home/ronghanghu/workspace/mmf/mmf/utils/env.py", line 146, in setup_imports
    "mmf.datasets." + folder_name + "." + dataset_name + "." + module_name
  File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/importlib/__init__.py", line 127, in import_module
    "mmf.datasets." + folder_name + "." + dataset_name + "." + module_name
  File "/private/home/ronghanghu/.conda/envs/dev2/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/conceptual_captions/__init__.py", line 9, in <module>
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/conceptual_captions/__init__.py", line 9, in <module>
    from .builder import ConceptualCaptionsBuilder
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/conceptual_captions/builder.py", line 4, in <module>
    from .builder import ConceptualCaptionsBuilder
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/conceptual_captions/builder.py", line 4, in <module>
    from mmf.datasets.builders.coco import COCOBuilder
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/coco/__init__.py", line 4, in <module>
    from mmf.datasets.builders.coco import COCOBuilder
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/coco/__init__.py", line 4, in <module>
    from .builder import COCOBuilder
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/coco/builder.py", line 10, in <module>
    from .builder import COCOBuilder
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/coco/builder.py", line 10, in <module>
    from mmf.datasets.builders.textcaps.dataset import TextCapsDataset
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/textcaps/dataset.py", line 2, in <module>
    from mmf.datasets.builders.textcaps.dataset import TextCapsDataset
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/textcaps/dataset.py", line 2, in <module>
    from mmf.datasets.builders.textvqa.dataset import TextVQADataset
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/textvqa/dataset.py", line 230
    blablabla*&^%$
              ^
SyntaxError: invalid syntax
    from mmf.datasets.builders.textvqa.dataset import TextVQADataset
  File "/private/home/ronghanghu/workspace/mmf/mmf/datasets/builders/textvqa/dataset.py", line 230
    blablabla*&^%$
              ^
SyntaxError: invalid syntax

apsdehal commented 4 years ago

I have looked into this and seems there is an extent upto which we can fix this. Spawned processes will always reload code in case of dataloader. So, if you have something broken at the top level (import level, something which will throw an error as soon as you import), you will always see an error (which is more of multiprocessing issue rather than MMF). #374 aims to fix the issue on MMF side where we import all of the files.

General recommendation is to keep two copies of the repo, one for development and one for running jobs. Keep them in sync via GitHub or some other settings. Alternatively, it would make sense to copy the files before running the job. It might be worth looking dumbo kind of setup for this.

facebookresearch / mmf