Exception: Invalid format for file paths provided

feefee20 commented 4 years ago

Hello,

Thanks for making this tool! While I was testing Tutorial 1 with your docker version, I got the following error at step 7. I checked all paths are properly set. Any idea or hint to fix this error? Thanks!

$atacworks/scripts/main.py train --config configs/train_config.yaml --config_mparams configs/model_structure.yaml --train_files Mono.50.2400.train.h5 --val_files Mono.50.2400.val.h5

Traceback (most recent call last):
  File "/AtacWorks/scripts/main.py", line 443, in <module>
    main()
  File "/AtacWorks/scripts/main.py", line 331, in main
    args.files_train = gather_files_from_cmdline(args.files_train)
  File "/usr/local/lib/python3.6/dist-packages/claragenomics/dl4atac/utils.py", line 234, in gather_files_from_cmdline
    raise Exception("Invalid format for file paths provided.")
Exception: Invalid format for file paths provided.

ntadimeti commented 4 years ago

Hi @wookyung

Looks like it's a typo in our tutorial. Good catch. Can you change --train_files to --files_train. It should work fine. I will create a PR to fix this immediately. Thank you.

feefee20 commented 4 years ago

Thanks for trying to fix it but I got the same errors with --files_train/--files_val. Was the fixing applied to a docker version, too? Thanks!

$atacworks/scripts/main.py train --config train_config.yaml --config_mparams model_structure.yaml --files_train Mono.50.2400.train.h5 --files_val Mono.50.2400.val.h5
Traceback (most recent call last):
  File "/AtacWorks/scripts/main.py", line 443, in <module>
    main()
  File "/AtacWorks/scripts/main.py", line 332, in main
    args.val_files = gather_files_from_cmdline(args.val_files)
  File "/usr/local/lib/python3.6/dist-packages/claragenomics/dl4atac/utils.py", line 234, in gather_files_from_cmdline
    raise Exception("Invalid format for file paths provided.")
Exception: Invalid format for file paths provided.

feefee20 commented 4 years ago

Hi @ntadimeti ,

I tried it again with --files_train and --val_files (my GPU setting: bsub -q foo-condo -gpu "num=4:gmodel=TeslaV100_SXM2_32GB" -a 'docker(alpine)' /bin/true) and got the more complicated errors as below :

$atacworks/scripts/main.py train --config train_config.yaml --config_mparams model_structure.yaml --files_train Mono.50.2400.train.h5 --val_files Mono.50.2400.val.h5
INFO:2020-05-11 19:29:04,996:AtacWorks-main] Running on GPU: 0
Building model: resnet ...
Finished building.
Saving config file to ./trained_models_2020.05.11_19.29/configs/model_structure.yaml...
Num_batches 500; rank 0, gpu 0

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
    data = self.data_queue.get(timeout=timeout)
  File "/usr/lib/python3.6/queue.py", line 173, in get
    self.not_empty.wait(remaining)
  File "/usr/lib/python3.6/threading.py", line 299, in wait
    gotit = waiter.acquire(True, timeout)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 102) is killed by signal: Bus error. 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/AtacWorks/scripts/main.py", line 443, in <module>
    main()
  File "/AtacWorks/scripts/main.py", line 359, in main
    train_worker(args.gpu, ngpus_per_node, args, timers=Timers)
  File "/AtacWorks/scripts/worker.py", line 221, in train_worker
    transform=args.transform)
  File "/usr/local/lib/python3.6/dist-packages/claragenomics/dl4atac/train.py", line 59, in train
    for i, batch in enumerate(train_loader):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 804, in __next__
    idx, data = self._get_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 761, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 737, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 102) exited unexpectedly

ntadimeti commented 4 years ago

@wookyung looks like pytorch needs extra shared memory allocation in the docker container. Here's some documentation on setting the shared memory size: https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#setincshmem.

You can use --shm-size 1GB or slightly higher when running the docker container and you should be fine.

ntadimeti commented 4 years ago

@wookyung closing this issue as the main problem has been resolved. Feel free to open new issues if you have problems in the future. Thanks for trying out AtacWorks.

NVIDIA-Genomics-Research / AtacWorks

Exception: Invalid format for file paths provided #145