TypeError: cannot pickle 'Environment' object

yayamamo commented 5 months ago

Hi, I use python 3.12.2 and torch 2.2.2 on macOS 12.7.4. The ReFinED version is 1.0. When trying fine-tuning the following error happened. ( python src/refined/training/fine_tune/fine_tune.py --experiment_name test )

TypeError: cannot pickle 'Environment' object

Could you tell me what could be done to workaround this issue?

Thanks.

yayamamo commented 5 months ago

This happens when there is no GPUs, but I am not sure how to workaround this.

shern2 commented 5 months ago

(I'm not from Amazon).

The Environment object is likely related to the lmdb's https://lmdb.readthedocs.io/en/release/#environment-class Without the full stack trace, I can only guess. My guess is something is trying to save the processor which includes the preprocessor, which includes the lookups to the lmdb tables. Perhaps the checkpoint portion.

Nevertheless, I suggest that you use a machine with GPU, because this is research code, and not 'battle-tested' in different environments (e.g. just pure CPU for training). I have testing for inference, CPU-only works, but didn't try training/fine-tuning with just CPU.

yayamamo commented 5 months ago

Thanks, it may be reasonable to try at a GPU machine.

Just for a reference, I put the full stack trace below.

/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
14:35:33 - __main__ - INFO - Fine-tuning end-to-end EL
14:36:00 - __main__ - INFO - Fine-tuning end-to-end EL
INFO:__main__:Fine-tuning end-to-end EL
  0%|                                                                                                                                                                                                                                  | 0/10 [00:00<?, ?it/s]14:36:02 - __main__ - INFO - Starting epoch number 0
INFO:__main__:Starting epoch number 0
14:36:02 - __main__ - INFO - lr: 0.0
INFO:__main__:lr: 0.0
14:36:02 - __main__ - INFO - lr: 0.0
INFO:__main__:lr: 0.0
14:36:02 - __main__ - INFO - lr: 0.0
INFO:__main__:lr: 0.0
14:36:02 - __main__ - INFO - lr: 0.0
INFO:__main__:lr: 0.0
14:36:02 - __main__ - INFO - lr: 0.0
INFO:__main__:lr: 0.0
  0%|                                                                                                                                                                                                                                  | 0/10 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/Users/yayamamo/git/ReFinED/src/refined/training/fine_tune/fine_tune.py", line 207, in <module>
    main()
  File "/Users/yayamamo/git/ReFinED/src/refined/training/fine_tune/fine_tune.py", line 44, in main
    start_fine_tuning_task(refined=refined,
  File "/Users/yayamamo/git/ReFinED/src/refined/training/fine_tune/fine_tune.py", line 95, in start_fine_tuning_task
    run_fine_tuning_loops(refined=refined, fine_tuning_args=fine_tuning_args,
  File "/Users/yayamamo/git/ReFinED/src/refined/training/fine_tune/fine_tune.py", line 114, in run_fine_tuning_loops
    for step, batch in tqdm(enumerate(training_dataloader), total=len(training_dataloader)):
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1040, in __init__
    w.start()
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/multiprocessing/context.py", line 289, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/yayamamo/.pyenv/versions/3.12.2/lib/python3.12/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'Environment' object

shern2 commented 5 months ago

My guess is that the torch data loader is trying spin up x number of worker processes to prepare the batches of data. Problem is likely here: https://github.com/amazon-science/ReFinED/blob/main/src/refined/dataset_reading/entity_linking/wikipedia_dataset.py

You can look it up, I think others also encountered similar issues with lmdb pickling when using multiple workers: https://github.com/pytorch/vision/issues/689#issuecomment-787215916

Cheers

amazon-science / ReFinED

TypeError: cannot pickle 'Environment' object #26