Closed tianxiangli1924 closed 2 years ago
BrokenPipeError: [Errno 32] Broken pipe I guess this is related to the defalt setting of num_workers?
Hello @tianxiangli1924, we need more information to help you debug. Do you have a full stack trace?
Hello @tianxiangli1924, we need more information to help you debug. Do you have a full stack trace?
Hi mmuckely, sorry i accidently close it as i am not familiar with github yet. here is my comment: i am CS student that wish to reproduce tthe UNET task.
Regarding my environment: torchvision==0.11.3 cu102 torch== 1.10.2 cuda10.2_cudnn7_0 runstats==2.0.0 pytorch_lightning==1.4.7 h5py==3.1.0 PyYAML==6.0 torchmetrics==0.7.2
I just download the github zip file to local computer. download the data of knee_singlecoil from NYU FastMRI saved to local: 'H:/dataset'
To repeat trainning, the code i used was:
python train_unet_demo.py --challenge singlecoil --data_path 'H:\dataset' --mask_type random
(FastMri) PS F:\机器学习总文件夹 台式机\OMSCS_第一学期课程\DL\project\fastMRI-main\fastmri_examples\unet> python train_unet_demo.py --challenge singlecoil --data_path 'H:\dataset' --mask_type random
Global seed set to 42 H:\anaconda3\envs\FastMRI\lib\site-packages\deprecate\deprecation.py:115: LightningDeprecationWarning: The Metric was deprecated since v1.3.0 in favor of torchmetrics.metric.Metric. It will be removed in v1.5.0. stream(template_mgs % msg_args) GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs Global seed set to 42 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1 distributed_backend=gloo All DDP processes registered. Starting ddp with 1 processes LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params 0 | NMSE | DistributedMetricSum | 0 1 | SSIM | DistributedMetricSum | 0 2 | PSNR | DistributedMetricSum | 0 3 | ValLoss | DistributedMetricSum | 0 4 | TotExamples | DistributedMetricSum | 0 5 | TotSliceExamples | DistributedMetricSum | 0 6 | unet | Unet | 7.8 M 7.8 M Trainable params 0 Non-trainable params 7.8 M Total params 31.024 Total estimated model params size (MB) Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]H
Validation sanity check: 50%|████████████████████████████████████████████████████████████████████████████████████▌ | 1/2 [00:17<00:17, 17.08s/it]T raceback (most recent call last): File "train_unet_demo.py", line 192, in run_cli() File "train_unet_demo.py", line 188, in run_cli cli_main(args) File "train_unet_demo.py", line 74, in cli_main trainer.fit(model, datamodule=data_module) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 552, in fit self._run(model) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 917, in _run self._dispatch() File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 985, in _dispatch self.accelerator.start_training(self) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 995, in run_stage return self._run_train() File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1030, in _run_train self._run_sanity_check(self.lightning_module) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1114, in _run_sanity_check self._evaluation_loop.run() File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run self.advance(*args, kwargs) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 110, in advance dataloader_iter, self.current_dataloader_idx, dl_max_batches, self.num_dataloaders File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run self.advance(*args, *kwargs) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\loops\epoch\evaluation_epoch_loop.py", line 126, in advance output = recursive_detach(output, to_cpu=self.trainer.move_metrics_to_cpu) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\utilities\memory.py", line 44, in recursive_detach return apply_to_collection(in_dict, torch.Tensor, detach_and_move, to_cpu=to_cpu) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\utilities\apply_func.py", line 105, in apply_to_collection v, dtype, function, args, wrong_dtype=wrong_dtype, include_none=include_none, kwargs File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\utilities\apply_func.py", line 109, in apply_to_collection return elem_type(OrderedDict(out)) TypeError: first argument must be callable or None
Hello @tianxiangli1924, we need more information to help you debug. Do you have a full stack trace?
I also couldn't run the pretrained model too. I didn't touch the code, simply wanna run first if works then start learning it. The code i used was:
python run_pretrained_unet_inference.py --data_path 'H:\dataset' --output_path 'H:\dataset' --challenge unet_knee_sc
Then:
Traceback (most recent call last):
File "run_pretrained_unet_inference.py", line 164, in
Hello @tianxiangli1924, I can't reproduce your first error. The code with your commands runs okay for me. My versions are:
Linux 5.4.0-81-generic x86_6 torchvision 0.10.0 torch 1.9.0 runstats 2.0.0 pytorch_lightning 1.4.7 h5py 2.10.0 PyYAML 5.4.1 torchmetrics 0.5.1
As for the second command, you specified the data path incorrectly - I think for this script you need to specify a specific split, like singlecoil_test
.
I am running into the same problem with training with 8 GPUs A100 80GB
python3 train_unet_demo.py --challenge singlecoil --data_path '/workspace/fastMRI' --mask_type random
Environment:
Error after validation check
Metric
was deprecated since v1.3.0 in favor of torchmetrics.metric.Metric
. It will be removed in v1.5.0.
stream(template_mgs % msg_args)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Global seed set to 42
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
1
Global seed set to 42
Global seed set to 42
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2distributed_backend=nccl All DDP processes registered. Starting ddp with 2 processes ----------------------------------------------------------------------------------------------------**
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
0 | NMSE | DistributedMetricSum | 0
1 | SSIM | DistributedMetricSum | 0
2 | PSNR | DistributedMetricSum | 0
3 | ValLoss | DistributedMetricSum | 0
4 | TotExamples | DistributedMetricSum | 0
5 | TotSliceExamples | DistributedMetricSum | 0
6 | unet | Unet | 7.8 M
----------------------------------------------------------**
7.8 M Trainable params
0 Non-trainable params
7.8 M Total params
31.024 Total estimated model params size (MB)
Validation sanity check: 50%|██████████ | 1/2 [00:16<00:16, 16.91s/it]Traceback (most recent call last):
File "train_unetdemo.py", line 191, in
Hello @SupermicroML, I installed your packages with some edits:
and I was able to run your code without errors (i.e., not able to reproduce). Based on the stack trace it looks like the launch script may have been modified. Are you using the launch script exactly as it is in main
? We can't help debug any modifications that you've made.
Hello @mmuckley
You were correct. I was able to run both training and inference successfully after I installed all packages and restarted my Jupyter notebook.
Thanks for the help!
Hello @tianxiangli1924, any progress on your issue from my last comment?
Closing due to inactivity. Reopen if needed.
Please: