requirements versions - Githubissues

SohilZidan commented 3 years ago

Hi, Can you add requirements.txt with versions, or mention them in README? Best

Tobias-Fischer commented 3 years ago

Hi, do you run into any issues if you simply issue the commands in https://github.com/Tobias-Fischer/rt_gene/blob/master/rt_gene/README.md? We don't use any uber-fancy techniques, and RT-GENE should run with tensorflow 1 and 2 as well as any recent-ish version of PyTorch.

SohilZidan commented 3 years ago

I am trying to train a pytorch model using the following command:

python3 ./rt_gene/rt_bene_model_training/pytorch/train_model.py \
--gpu 3 \
--hdf5_file ./rt-bene/rtbene_dataset.hdf5 \
--save_dir ./<some-pytorch-models-dir> \
--k_fold_validation

rt-bene contains all rt-bene dataset subjects. rtbene_dataset.hdf5 is generated as mentioned in README. packages versions:

h5py==2.10.0
pytorch-lightning==1.2.9
torch==1.8.0

I get the following error log:

Global seed set to 0
Loading class weights...: 100%|################################################################################| 4/4 [02:31<00:00, 37.85s/it]
/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/3
Global seed set to 0
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/3
Global seed set to 0
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/3

  | Name       | Type                            | Params
---------------------------------------------------------------
0 | _model     | BlinkEstimationModelDenseNet121 | 15.0 M
1 | _criterion | BCEWithLogitsLoss               | 0     
---------------------------------------------------------------
15.0 M    Trainable params
0         Non-trainable params
15.0 M    Total params
59.833    Total estimated model params size (MB)
/home/zidan/.local/lib/python3.6/site-packages/torchvision/transforms/transforms.py:281: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
/home/zidan/.local/lib/python3.6/site-packages/torchvision/transforms/transforms.py:281: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
/home/zidan/.local/lib/python3.6/site-packages/torchvision/transforms/transforms.py:281: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
Loading (valid) subject metadata...: 100%|#####################################################################| 4/4 [02:02<00:00, 30.67s/it]
Loading (valid) subject metadata...: 100%|#####################################################################| 4/4 [02:02<00:00, 30.67s/it]

/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: Dataloader(num_workers>0) and ddp_spawn do not mix well! Your performance might suffer dramatically. Please consider setting accelerator=ddp to use num_workers > 0 (this is a bottleneck of Python .spawn() and PyTorch
  warnings.warn(*args, **kwargs)
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "./rt_gene/rt_bene_model_training/pytorch/train_model.py", line 197, in <module>
    trainer.fit(_model)
  File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 107, in start_training
    mp.spawn(self.new_process, **self.mp_spawn_kwargs)
  File "/home/zidan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/zidan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/zidan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/zidan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 160, in new_process
    results = trainer.train_or_test_or_predict()
  File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 556, in train_or_test_or_predict
    results = self.run_train()
  File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
    self.run_sanity_check(self.lightning_module)
  File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 864, in run_sanity_check
    _, eval_results = self.run_evaluation(max_batches=self.num_sanity_val_batches)
  File "/home/zidan/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 713, in run_evaluation
    for batch_idx, batch in enumerate(dataloader):
  File "/home/zidan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
    return self._get_iterator()
  File "/home/zidan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/zidan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
    w.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/zidan/.local/lib/python3.6/site-packages/h5py/_hl/base.py", line 308, in __getnewargs__
    raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled

using a higher pytorch-lightning>=1.3.0 results in the following:

Global seed set to 0
Loading class weights...: 100%|################################################################################| 4/4 [01:24<00:00, 21.03s/it]
Traceback (most recent call last):
  File "./rt_gene/rt_bene_model_training/pytorch/train_model.py", line 181, in <module>
    class_weights=_class_weights)
  File "./rt_gene/rt_bene_model_training/pytorch/train_model.py", line 43, in __init__
    self.hparams = hparams
  File "/home/zidan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 826, in __setattr__
    object.__setattr__(self, name, value)
AttributeError: can't set attribute

Tobias-Fischer commented 3 years ago

So for pytorch-lightning >= 1.3.0 it seems like the way that hyper-parameters are used has changed .. this needs some updating in the code. I'm not sure what the issue for pytorch-lightning < 1.3.0 is though - any clue @ahmed-alhindawi?

ahmed-alhindawi commented 3 years ago

You're quite right - the newer versions of pytorch-lightning have a save_parameters function that we should use - I've cleaned up the code and made a pull request....

Tobias-Fischer / rt_gene

requirements versions #108