Closed SohilZidan closed 3 years ago
Hi, do you run into any issues if you simply issue the commands in https://github.com/Tobias-Fischer/rt_gene/blob/master/rt_gene/README.md? We don't use any uber-fancy techniques, and RT-GENE should run with tensorflow 1 and 2 as well as any recent-ish version of PyTorch.
I am trying to train a pytorch model using the following command:
python3 ./rt_gene/rt_bene_model_training/pytorch/train_model.py \
--gpu 3 \
--hdf5_file ./rt-bene/rtbene_dataset.hdf5 \
--save_dir ./<some-pytorch-models-dir> \
--k_fold_validation
rt-bene contains all rt-bene dataset subjects. rtbene_dataset.hdf5 is generated as mentioned in README. packages versions:
h5py==2.10.0
pytorch-lightning==1.2.9
torch==1.8.0
I get the following error log:
Global seed set to 0
Loading class weights...: 100%|################################################################################| 4/4 [02:31<00:00, 37.85s/it]
/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/3
Global seed set to 0
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/3
Global seed set to 0
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/3
| Name | Type | Params
---------------------------------------------------------------
0 | _model | BlinkEstimationModelDenseNet121 | 15.0 M
1 | _criterion | BCEWithLogitsLoss | 0
---------------------------------------------------------------
15.0 M Trainable params
0 Non-trainable params
15.0 M Total params
59.833 Total estimated model params size (MB)
/home/zidan/.local/lib/python3.6/site-packages/torchvision/transforms/transforms.py:281: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
"Argument interpolation should be of type InterpolationMode instead of int. "
/home/zidan/.local/lib/python3.6/site-packages/torchvision/transforms/transforms.py:281: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
"Argument interpolation should be of type InterpolationMode instead of int. "
/home/zidan/.local/lib/python3.6/site-packages/torchvision/transforms/transforms.py:281: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
"Argument interpolation should be of type InterpolationMode instead of int. "
Loading (valid) subject metadata...: 100%|#####################################################################| 4/4 [02:02<00:00, 30.67s/it]
Loading (valid) subject metadata...: 100%|#####################################################################| 4/4 [02:02<00:00, 30.67s/it]
/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: Dataloader(num_workers>0) and ddp_spawn do not mix well! Your performance might suffer dramatically. Please consider setting accelerator=ddp to use num_workers > 0 (this is a bottleneck of Python .spawn() and PyTorch
warnings.warn(*args, **kwargs)
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
File "./rt_gene/rt_bene_model_training/pytorch/train_model.py", line 197, in <module>
trainer.fit(_model)
File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 107, in start_training
mp.spawn(self.new_process, **self.mp_spawn_kwargs)
File "/home/zidan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/zidan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/zidan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/zidan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 160, in new_process
results = trainer.train_or_test_or_predict()
File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 556, in train_or_test_or_predict
results = self.run_train()
File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
self.run_sanity_check(self.lightning_module)
File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 864, in run_sanity_check
_, eval_results = self.run_evaluation(max_batches=self.num_sanity_val_batches)
File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 713, in run_evaluation
for batch_idx, batch in enumerate(dataloader):
File "/home/zidan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
return self._get_iterator()
File "/home/zidan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/zidan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
w.start()
File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "/home/zidan/.local/lib/python3.6/site-packages/h5py/_hl/base.py", line 308, in __getnewargs__
raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled
using a higher pytorch-lightning>=1.3.0 results in the following:
Global seed set to 0
Loading class weights...: 100%|################################################################################| 4/4 [01:24<00:00, 21.03s/it]
Traceback (most recent call last):
File "./rt_gene/rt_bene_model_training/pytorch/train_model.py", line 181, in <module>
class_weights=_class_weights)
File "./rt_gene/rt_bene_model_training/pytorch/train_model.py", line 43, in __init__
self.hparams = hparams
File "/home/zidan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 826, in __setattr__
object.__setattr__(self, name, value)
AttributeError: can't set attribute
So for pytorch-lightning >= 1.3.0 it seems like the way that hyper-parameters are used has changed .. this needs some updating in the code. I'm not sure what the issue for pytorch-lightning < 1.3.0 is though - any clue @ahmed-alhindawi?
You're quite right - the newer versions of pytorch-lightning have a save_parameters
function that we should use - I've cleaned up the code and made a pull request....
Hi, Can you add requirements.txt with versions, or mention them in README? Best