Project-MONAI / tutorials

MONAI Tutorials
https://monai.io/started.html
Apache License 2.0
1.76k stars 665 forks source link

fork vs spawn on MacOS Python 3.9 error #625

Open sporring opened 2 years ago

sporring commented 2 years ago

Hi all,

I'm running macos conda-forge on the M1 architecture and testing the mednist_tutorial.ipynb and other jupyter notebooks. I get the following error

  File "/opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'MedNISTDataset' on <module '__main__' (built-in)>

This seems to be caused by spawn which apparently became default by Python 3.8 and later. It is fixed by adding the argument multiprocessing_context="fork" to calls of torch.utils.data.DataLoader, however, I suggest that a deeper fixed is made for non-experts.

Steps to reproduce the behavior On MacOS/python-forge

  1. jupyter lab mednist_tutorial.ipynb
  2. press the clear kernel and run all button

Expected behavior A demo network should be set up and trained

Environment

% python --version
Python 3.9.10
% python -c 'import monai; monai.config.print_debug_info()'

================================
Printing MONAI config...
================================
MONAI version: 0.9.dev2210
Numpy version: 1.22.3
Pytorch version: 1.10.2
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: 1a660e6a7a50e985af5ff76b559baab44175438c
MONAI __file__: /opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: NOT INSTALLED or UNKNOWN VERSION.
scikit-image version: NOT INSTALLED or UNKNOWN VERSION.
Pillow version: 9.0.1
Tensorboard version: 2.8.0
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: NOT INSTALLED or UNKNOWN VERSION.
tqdm version: 4.63.0
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: NOT INSTALLED or UNKNOWN VERSION.
pandas version: NOT INSTALLED or UNKNOWN VERSION.
einops version: 0.4.1
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================
Printing system config...
================================
`psutil` required for `print_system_info`

================================
Printing GPU config...
================================
Num GPUs: 0
Has CUDA: False
cuDNN enabled: False

Thank you for a nice package, Jon

Nic-Ma commented 2 years ago

Hi @sporring ,

Thanks for the investigation and suggestion. @yiheng-wang-nv @wyli I think maybe we can add the argument multiprocessing_context="fork" of torch.utils.data.DataLoader in some example or tutorial to mark the use case? What do you think?

Thanks in advance.

wyli commented 2 years ago

I don't think we'll change the core codebase default value, so I'm converting this to a feature request to the tutorials...

Johnz86 commented 2 years ago

I am beginner. I have the same problem. I tried today to do this:

train_ds = MedNISTDataset(train_x, train_y, train_transforms)
train_loader = torch.utils.data.DataLoader(
    train_ds, batch_size=300, shuffle=True, num_workers=10, multiprocessing_context="fork")

Which resulted in error:

ValueError: multiprocessing_context option should specify a valid start method in ['spawn'], but got multiprocessing_context='fork'

I checked the available methods and got this:

In[2]: import torch.multiprocessing as multiprocessing
multiprocessing.get_all_start_methods()
Out[3]: ['spawn']

What are the steps needed to get this 'fork' working?

Nic-Ma commented 2 years ago

I feel maybe your OS doesn't support fork method? And this initial issue seems like a PyTorch known problem, you may find some solution or workaround: https://github.com/pytorch/pytorch/issues/70344

Thanks.

Johnz86 commented 2 years ago

The linked bug is mac related and I work on windows. python -c 'import monai; monai.config.print_debug_info()'

================================
Printing MONAI config...
================================
MONAI version: 0.8.1
Numpy version: 1.22.3
Pytorch version: 1.9.0+cu111
MONAI flags: HAS_EXT = False, USE_COMPILED = False    
MONAI rev id: 71ff399a3ea07aef667b23653620a290364095b1

Optional dependencies:
Pytorch Ignite version: 0.4.8
Nibabel version: 3.2.2
scikit-image version: 0.19.2
Pillow version: 9.1.0
Tensorboard version: 2.8.0
gdown version: 4.4.0
TorchVision version: 0.10.0+cu111
tqdm version: 4.64.0
lmdb version: 1.3.0
psutil version: 5.9.0
pandas version: NOT INSTALLED or UNKNOWN VERSION.
einops version: 0.3.2
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================
Printing system config...
================================
System: Windows
Win32 version: ('10', '10.0.18363', 'SP0', 'Multiprocessor Free')
Win32 edition: Enterprise
Platform: Windows-10-10.0.18363-SP0
Processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
Machine: AMD64
Python version: 3.9.5
Process name: python.exe
Command: ['C:\\Python39\\python.exe', '-c', 'import monai; monai.config.print_debug_info()']
Open files: [popenfile(path='C:\\WINDOWS\\System32\\en-US\\KernelBase.dll.mui', fd=-1), popenfile(path='C:\\WINDOWS\\System32\\en-US\\kernel32.dll.mui', fd=-1)]
Num physical CPUs: 6
Num logical CPUs: 12
Num usable CPUs: 12
CPU usage (%): [29.7, 8.5, 36.3, 9.9, 17.3, 55.1, 25.7, 8.1, 26.0, 11.3, 20.8, 52.7]
CPU freq. (MHz): 2592
Load avg. in last 1, 5, 15 mins (%): [0.0, 0.0, 0.0]
Disk usage (%): 92.3
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 31.8
Available memory (GB): 17.6
Used memory (GB): 14.2

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 11.1
cuDNN enabled: True
cuDNN version: 8005
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_37']
GPU 0 Name: Quadro P2000
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 6
GPU 0 Total memory (GB): 4.0
GPU 0 CUDA capability (maj.min): 6.1

Are there any specific configuration options related to windows and monai?

ericspod commented 2 years ago

Windows doesn't support fork semantics natively. We've had issues with Windows before and have advised the solution is to use the local worker only, so train_loader = torch.utils.data.DataLoader(train_ds, batch_size=300, shuffle=True, num_workers=0).

Johnz86 commented 2 years ago

After I set the num_workers=0, I could get a step further with the basic 2d segmentation example. I hit a problem during model training at second step with RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 4.00 GiB total capacity; 2.70 GiB already allocated; 7. 80 MiB free; 2.78 GiB reserved in total by PyTorch). I fixed this with setting: torch.device("cpu"). On first run it took me 1 hour 32 minutes to train the simplest example on my CPU. It would be nice to mention basic requirements for local run and simple Windows/Mac/Linux platform recommendation in the notebook.

Nic-Ma commented 2 years ago

Hi @ericspod @wyli ,

I think maybe we can add some description about the platform in the requirements of README doc: https://github.com/Project-MONAI/tutorials/blob/master/README.md#1-requirements What do you think?

Thanks in advance.

ericspod commented 2 years ago

We should add something there and a little warning about Windows behaviour.

Nic-Ma commented 2 years ago

Hi @ericspod ,

Would you like to contribute a PR for it? I think you know more details about the platforms.

Thanks.

ericspod commented 2 years ago

I'm not really that familiar with the runtime costs of the tutorials, I'm not sure Richard is or not. For the Windows issue I'd just add "Windows users may need to set the num_workers argument of DataLoader to 0 if errors are encountered during training."

Nic-Ma commented 2 years ago

OK, maybe @wyli knows more details from CI environment.

Thanks.

wyli commented 2 years ago

yes we currently only have basic unit tests for windows. most of the integration, multi-processing, and file system accessing tests are skipped on windows. we should spend more effort on this if there are enough user interests...