Closed feefee20 closed 4 years ago
Hi @wookyung
Looks like it's a typo in our tutorial. Good catch. Can you change --train_files
to --files_train
. It should work fine. I will create a PR to fix this immediately. Thank you.
Thanks for trying to fix it but I got the same errors with --files_train/--files_val. Was the fixing applied to a docker version, too? Thanks!
$atacworks/scripts/main.py train --config train_config.yaml --config_mparams model_structure.yaml --files_train Mono.50.2400.train.h5 --files_val Mono.50.2400.val.h5
Traceback (most recent call last):
File "/AtacWorks/scripts/main.py", line 443, in <module>
main()
File "/AtacWorks/scripts/main.py", line 332, in main
args.val_files = gather_files_from_cmdline(args.val_files)
File "/usr/local/lib/python3.6/dist-packages/claragenomics/dl4atac/utils.py", line 234, in gather_files_from_cmdline
raise Exception("Invalid format for file paths provided.")
Exception: Invalid format for file paths provided.
Hi @ntadimeti ,
I tried it again with --files_train and --val_files (my GPU setting: bsub -q foo-condo -gpu "num=4:gmodel=TeslaV100_SXM2_32GB" -a 'docker(alpine)' /bin/true) and got the more complicated errors as below :
$atacworks/scripts/main.py train --config train_config.yaml --config_mparams model_structure.yaml --files_train Mono.50.2400.train.h5 --val_files Mono.50.2400.val.h5
INFO:2020-05-11 19:29:04,996:AtacWorks-main] Running on GPU: 0
Building model: resnet ...
Finished building.
Saving config file to ./trained_models_2020.05.11_19.29/configs/model_structure.yaml...
Num_batches 500; rank 0, gpu 0
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
data = self.data_queue.get(timeout=timeout)
File "/usr/lib/python3.6/queue.py", line 173, in get
self.not_empty.wait(remaining)
File "/usr/lib/python3.6/threading.py", line 299, in wait
gotit = waiter.acquire(True, timeout)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 102) is killed by signal: Bus error.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/AtacWorks/scripts/main.py", line 443, in <module>
main()
File "/AtacWorks/scripts/main.py", line 359, in main
train_worker(args.gpu, ngpus_per_node, args, timers=Timers)
File "/AtacWorks/scripts/worker.py", line 221, in train_worker
transform=args.transform)
File "/usr/local/lib/python3.6/dist-packages/claragenomics/dl4atac/train.py", line 59, in train
for i, batch in enumerate(train_loader):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 804, in __next__
idx, data = self._get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 761, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 737, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 102) exited unexpectedly
@wookyung looks like pytorch needs extra shared memory allocation in the docker container. Here's some documentation on setting the shared memory size: https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#setincshmem.
You can use --shm-size 1GB
or slightly higher when running the docker container and you should be fine.
@wookyung closing this issue as the main problem has been resolved. Feel free to open new issues if you have problems in the future. Thanks for trying out AtacWorks.
Hello,
Thanks for making this tool! While I was testing Tutorial 1 with your docker version, I got the following error at step 7. I checked all paths are properly set. Any idea or hint to fix this error? Thanks!