I am trying to repeat the Unet but have [Errno 32] Broken pipe issue

tianxiangli1924 commented 2 years ago

Please:

[x] Check for duplicate issues.
[x] Provide a simple example for how to reproduce the bug.
[x] If applicable, include full error messages/tracebacks.

tianxiangli1924 commented 2 years ago

BrokenPipeError: [Errno 32] Broken pipe I guess this is related to the defalt setting of num_workers?

mmuckley commented 2 years ago

Hello @tianxiangli1924, we need more information to help you debug. Do you have a full stack trace?

tianxiangli1924 commented 2 years ago

Hello @tianxiangli1924, we need more information to help you debug. Do you have a full stack trace?

Hi mmuckely, sorry i accidently close it as i am not familiar with github yet. here is my comment: i am CS student that wish to reproduce tthe UNET task.

Regarding my environment: torchvision==0.11.3 cu102 torch== 1.10.2 cuda10.2_cudnn7_0 runstats==2.0.0 pytorch_lightning==1.4.7 h5py==3.1.0 PyYAML==6.0 torchmetrics==0.7.2

I just download the github zip file to local computer. download the data of knee_singlecoil from NYU FastMRI saved to local: 'H:/dataset' a9403174142dff3c822c54fe81924c1

To repeat trainning, the code i used was:

python train_unet_demo.py --challenge singlecoil --data_path 'H:\dataset' --mask_type random

(FastMri) PS F:\机器学习总文件夹台式机\OMSCS_第一学期课程\DL\project\fastMRI-main\fastmri_examples\unet> python train_unet_demo.py --challenge singlecoil --data_path 'H:\dataset' --mask_type random

Global seed set to 42 H:\anaconda3\envs\FastMRI\lib\site-packages\deprecate\deprecation.py:115: LightningDeprecationWarning: The Metric was deprecated since v1.3.0 in favor of torchmetrics.metric.Metric. It will be removed in v1.5.0. stream(template_mgs % msg_args) GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs Global seed set to 42 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1 distributed_backend=gloo All DDP processes registered. Starting ddp with 1 processes LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Validation sanity check: 50%|████████████████████████████████████████████████████████████████████████████████████▌ | 1/2 [00:17<00:17, 17.08s/it]T raceback (most recent call last): File "train_unet_demo.py", line 192, in run_cli() File "train_unet_demo.py", line 188, in run_cli cli_main(args) File "train_unet_demo.py", line 74, in cli_main trainer.fit(model, datamodule=data_module) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 552, in fit self._run(model) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 917, in _run self._dispatch() File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 985, in _dispatch self.accelerator.start_training(self) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 995, in run_stage return self._run_train() File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1030, in _run_train self._run_sanity_check(self.lightning_module) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1114, in _run_sanity_check self._evaluation_loop.run() File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run self.advance(*args, kwargs) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 110, in advance dataloader_iter, self.current_dataloader_idx, dl_max_batches, self.num_dataloaders File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run self.advance(*args, *kwargs) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\loops\epoch\evaluation_epoch_loop.py", line 126, in advance output = recursive_detach(output, to_cpu=self.trainer.move_metrics_to_cpu) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\utilities\memory.py", line 44, in recursive_detach return apply_to_collection(in_dict, torch.Tensor, detach_and_move, to_cpu=to_cpu) File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\utilities\apply_func.py", line 105, in apply_to_collection v, dtype, function, args, wrong_dtype=wrong_dtype, include_none=include_none, kwargs File "H:\anaconda3\envs\FastMRI\lib\site-packages\pytorch_lightning\utilities\apply_func.py", line 109, in apply_to_collection return elem_type(OrderedDict(out)) TypeError: first argument must be callable or None

tianxiangli1924 commented 2 years ago

Hello @tianxiangli1924, we need more information to help you debug. Do you have a full stack trace?

I also couldn't run the pretrained model too. I didn't touch the code, simply wanna run first if works then start learning it. The code i used was:

python run_pretrained_unet_inference.py --data_path 'H:\dataset' --output_path 'H:\dataset' --challenge unet_knee_sc

Then: Traceback (most recent call last): File "run_pretrained_unet_inference.py", line 164, in torch.device(args.device), File "run_pretrained_unet_inference.py", line 91, in run_inference challenge="singlecoil", File "H:\anaconda3\envs\FastMRI\lib\site-packages\fastmri\data\mri_data.py", line 268, in init metadata, num_slices = self._retrieve_metadata(fname) File "H:\anaconda3\envs\FastMRI\lib\site-packages\fastmri\data\mri_data.py", line 305, in _retrieve_metadata with h5py.File(fname, "r") as hf: File "H:\anaconda3\envs\FastMRI\lib\site-packages\h5py_hl\files.py", line 427, in init swmr=swmr) File "H:\anaconda3\envs\FastMRI\lib\site-packages\h5py_hl\files.py", line 190, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py\h5f.pyx", line 96, in h5py.h5f.open OSError: Unable to open file (unable to open file: name = 'H:\dataset\singlecoil_challenge', errno = 13, error message = 'Permission denied', flags = 0, o_flags = 0)

mmuckley commented 2 years ago

Hello @tianxiangli1924, I can't reproduce your first error. The code with your commands runs okay for me. My versions are:

Linux 5.4.0-81-generic x86_6 torchvision 0.10.0 torch 1.9.0 runstats 2.0.0 pytorch_lightning 1.4.7 h5py 2.10.0 PyYAML 5.4.1 torchmetrics 0.5.1

As for the second command, you specified the data path incorrectly - I think for this script you need to specify a specific split, like singlecoil_test.

SupermicroML commented 2 years ago

I am running into the same problem with training with 8 GPUs A100 80GB

python3 train_unet_demo.py --challenge singlecoil --data_path '/workspace/fastMRI' --mask_type random

Environment:

_NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4
torch==1.10.0
pytorch_lightning==1.4.7
h5py==3.6.0
runstats==2.0.0
torchmetrics==0.7.3
PyYAML==5.4.1
pandas==1.4.1
numpy==1.21.2
skimage==0.19.1_

Error after validation check

Global seed set to 42 /opt/conda/lib/python3.8/site-packages/deprecate/deprecation.py:115: LightningDeprecationWarning: The `Metric` was deprecated since v1.3.0 in favor of `torchmetrics.metric.Metric`. It will be removed in v1.5.0. stream(template_mgs % msg_args) GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs Global seed set to 42 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2 1 Global seed set to 42 Global seed set to 42 initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2

distributed_backend=nccl All DDP processes registered. Starting ddp with 2 processes ----------------------------------------------------------------------------------------------------**

LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

| Name | Type | Params

mmuckley commented 2 years ago

Hello @SupermicroML, I installed your packages with some edits:

I'm using CUDA 11.3
I couldn't find PyYAML 5.4.3 - I had to use 5.4.1

and I was able to run your code without errors (i.e., not able to reproduce). Based on the stack trace it looks like the launch script may have been modified. Are you using the launch script exactly as it is in main? We can't help debug any modifications that you've made.

SupermicroML commented 2 years ago

Hello @mmuckley

You were correct. I was able to run both training and inference successfully after I installed all packages and restarted my Jupyter notebook.

Thanks for the help!

mmuckley commented 2 years ago

Hello @tianxiangli1924, any progress on your issue from my last comment?

mmuckley commented 2 years ago

Closing due to inactivity. Reopen if needed.

facebookresearch / fastMRI

I am trying to repeat the Unet but have [Errno 32] Broken pipe issue #228

| Name | Type | Params