facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.5k stars 939 forks source link

unpacking features get stuck / loading datasets get stuck #691

Closed cyang31 closed 3 years ago

cyang31 commented 3 years ago

Instructions To Reproduce the Issue:

Check https://stackoverflow.com/help/minimal-reproducible-example for how to ask good questions. Simplify the steps to reproduce the issue using suggestions from the above link, and provide them below:

  1. full code you wrote or full changes you made (git diff) I didn't change the code.
  2. what exact command you run: I run this following code under a singularity container. singularity exec --nv singularity.sif mmf_run config=/scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml model=visual_bert dataset=hateful_memes env.data_dir=/scratch/UserName/hateful_meme/data training.num_workers=1 training.fast_read=True
  3. full logs you observed: WARNING: underlay of /usr/bin/nvidia-debugdump required more than 50 (375) bind mounts /usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py:252: UserWarning: Keys with dot (model.bert) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1) See the compact keys issue for more details: https://github.com/omry/omegaconf/issues/152 You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1 warnings.warn(message=msg, category=UserWarning) 2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option config to /scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml 2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option model to visual_bert 2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option datasets to hateful_memes 2020-11-18T07:22:20 | mmf.utils.configuration: Overriding option env.data_dir to /scratch/UserName/hateful_meme/data 2020-11-18T07:22:20 | mmf: Logging to: ./save/train.log 2020-11-18T07:22:20 | mmf_cli.run: Namespace(config_override=None, local_rank=None, opts=['config=/scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml', 'model=visual_bert', 'dataset=hateful_memes', 'env.data_dir=/scratch/UserName/hateful_meme/data']) 2020-11-18T07:22:20 | mmf_cli.run: Torch version: 1.6.0+cu101 2020-11-18T07:22:20 | mmf.utils.general: CUDA Device 0 is: Tesla V100-PCIE-32GB 2020-11-18T07:22:20 | mmf_cli.run: Using seed 21259699 2020-11-18T07:22:20 | mmf.trainers.mmf_trainer: Loading datasets [ Starting checksum for features.tar.gz] [ Checksum successful for features.tar.gz] Unpacking features.tar.gz

Expected behavior:

No error pops out, but it takes forever to unpack the features.tar.gz file, which is unexpected. I tried to manually download and unpack it locally in order to check whether it is the problem of slow unpacking and it turned out not. However, when I rerun the above code after that, it actually bypassed the downloading stage but get stuck again at "mmf.trainers.mmf_trainer: Loading datasets". I waited overnight to make sure it is not just too slow, but nothing changed.

Environment:

WARNING: underlay of /usr/bin/nvidia-debugdump required more than 50 (375) bind mounts Collecting environment information... PyTorch version: 1.6.0+cu101 Is debug build: No CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 CMake version: Could not collect

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: Tesla K20Xm Nvidia driver version: 418.39 cuDNN version: Could not collect

Versions of relevant libraries: [pip3] numpy==1.19.4 [pip3] torch==1.6.0+cu101 [pip3] torchtext==0.5.0 [pip3] torchvision==0.7.0+cu101 [conda] Could not collect

apsdehal commented 3 years ago

Hi, For your command can you run it with this command to see if it solves your issue:

singularity.sif CUDA_VISIBLE_DEVICES=0 mmf_run config=/scratch/UserName/hateful_meme/mmf/projects/hateful_memes/configs/visual_bert/direct.yaml model=visual_bert dataset=hateful_memes env.data_dir=/scratch/UserName/hateful_meme/data training.num_workers=0

You don't need training.fast_read, it is for something else. Specifically, note CUDA_VISIBLE_DEVICES=0 to run it on single GPU and training.num_workers=0 to run it with only one dataset worker.

cyang31 commented 3 years ago

Hi, I tried your suggested command, but it still gets stuck at the same place. If it is useful, I can smoothly launch the code only for baseline Image-Grid. I find that the similar issue always happens when it tries to unpack things, either extras.tar.gz or features.tar.gz. Those files can be automatically downloaded and during unpacking, the size of resulted files keeps growing then becomes stable at some point but the main code gets stuck at "unpacking X.tar.gz" and doesn't change.

apsdehal commented 3 years ago

Something must be off in singularity because this works fine as it is. Can you try running the command outside of singularity?