fastai / fastai

The fastai deep learning library
http://docs.fast.ai
Apache License 2.0
26.23k stars 7.56k forks source link

fastai from windows script #3394

Closed nikky4D closed 3 years ago

nikky4D commented 3 years ago

Please confirm you have the latest versions of fastai, fastcore, and nbdev prior to reporting a bug (delete one): YES

Describe the bug in my image dataloader, using num_workers > 0 (here num_workers = 2), and using multiprocessing.set_start_method('spawn') following the windows script example, I get the error "THCudaCheck". When I set num_workers = 0, the model builds and trains This occurs only when fit_one_cycle() is called.

Error with full stack trace

THCudaCheck FAIL file=..\torch/csrc/generic/StorageSharing.cpp line=253 error=801 : operation not supported
Traceback (most recent call last):
  File "fastai_train_faces.py", line 354, in <module>
    main(args)
  File "fastai_train_faces.py", line 242, in main
    model_learner = train_fn(args)
  File "fastai_train_faces.py", line 229, in train_fn
    model_learner.fit_one_cycle(n_epoch= num_epochs,
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\callback\schedule.py", line 112, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\learner.py", line 218, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\learner.py", line 209, in _do_fit
    self._with_events(self._do_epoch, 'epoch', CancelEpochException)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\learner.py", line 203, in _do_epoch
    self._do_epoch_train()
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\learner.py", line 195, in _do_epoch_train
    self._with_events(self.all_batches, 'train', CancelTrainException)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\learner.py", line 160, in _with_events
    try: self(f'before_{event_type}');  f()
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\learner.py", line 166, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\fastai\data\load.py", line 109, in __iter__
    for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\torch\utils\data\dataloader.py", line 914, in __init__
    w.start()
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\site-packages\torch\multiprocessing\reductions.py", line 240, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
RuntimeError: cuda runtime error (801) : operation not supported at ..\torch/csrc/generic/StorageSharing.cpp:253

(facesYr2_env) C:\Users\nuzuegbunam\Documents\Projects\2500-FACES\faces_emotion>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\nuzuegbunam\Anaconda3\envs\facesYr2_env\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
coldfir3 commented 3 years ago

I came here to ask the exact same thing.

I can run FastAI locally just fine with num_workers = 0 on Windows, but that is painfully slow. The GPU load keeps fluctuating and the training times are 3~4x longer than what they should.

Is there any tutorial/guide/best practices to run FAST AI on windows? I am not saying about installation, cuda or anything like that. My only issue is with the dataloader.

On a side note, I managed to run almost the same pipeline using Pytorch Lightning by wrapping my call inside

if __name__ == '__main__':
    main_train_loop()

but this didn’t work with fastai for me. I keep getting pickling errors relating to my augmentation functions (not using any lambda func)

Thanks in advance, any tip would be helpful.

nikky4D commented 3 years ago

A tutorial / sample would be good. @coldfir3, can you elaborate on what you did with pytorch lightning?

coldfir3 commented 3 years ago

Of course. This is the code I used to train https://colab.research.google.com/drive/1gJ0sT5wCBbJRLRU9htfyKSIW73y4jtEn?usp=sharing For some reason it won't work using jupyter so I had to save it as .py and run with 'python code.py'

nikky4D commented 3 years ago

Thanks for sharing. This is really helpful for me.

@muellerzr Would you have any advice for fastai on windows speedup? The sample code does not work for me.

coldfir3 commented 3 years ago

Sadly I have not, I could not make fastai to work fast (sorry for the joke) on windows. In the end I installed Ubuntu and whenever I need to run something locally I just change OS... It is a pain, but better than use 5% of my GPU when training with a single worker. :)

nikky4D commented 3 years ago

Sadly I have not, I could not make fastai to work fast (sorry for the joke) on windows. In the end I installed Ubuntu and whenever I need to run something locally I just change OS... It is a pain, but better than use 5% of my GPU when training with a single worker. :)

I am going that route as well. Thanks for the links.