Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

Import order with hf-datasets,pl and spacy leads to unpickling error #16897

Open zerogerc opened 1 year ago

zerogerc commented 1 year ago

Bug description

Hi, I've encountered a really strange problem occurring during data processing in our training pipelines.

I've managed to distill the problem to a single script:

import pytorch_lightning
from datasets import Dataset
import spacy

def main():
    dataset = Dataset.from_dict({
        "ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
    })

    def _map_raw(examples_batch):
        spacy.load("en_core_web_sm")
        return examples_batch

    module = pytorch_lightning.LightningModule()
    dataset.map(_map_raw, batched=True, batch_size=2, num_proc=4)

if __name__ == '__main__':
    main()

This fails with an error

Exception in thread Thread-3:                                                                                                                                                                                                  | 0/2 [00:00<?, ?ba/s]
Traceback (most recent call last):                                                                                                                                                                                             | 0/2 [00:00<?, ?ba/s]
  File "/home/sazanovich/.cache/bazel/_bazel_sazanovich/b6bfd90e9c0267baf464defccc50e727/external/python3_9_x86_64-unknown-linux-gnu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/sazanovich/.cache/bazel/_bazel_sazanovich/b6bfd90e9c0267baf464defccc50e727/external/python3_9_x86_64-unknown-linux-gnu/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/multiprocess/pool.py", line 576, in _handle_results
    task = get()
  File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/multiprocess/connection.py", line 259, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/dill/_dill.py", line 286, in loads
    return load(file, ignore, **kwds)
  File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/dill/_dill.py", line 272, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/home/sazanovich/.local/share/virtualenvs/grazie-ml-O5jEj6jC/lib/python3.9/site-packages/dill/_dill.py", line 419, in load
    obj = StockUnpickler.load(self)
TypeError: __init__() takes 1 positional argument but 2 were given

What's interesting here is that this code can be fixed in several ways:

  1. Remove everything connected to PL
  2. Remove spacy.load from _map_raw
  3. Move pl import after imports on spacy and datasets

I understand that this could be not a PL issue, but could you advise me on how is this happening? Where should I look? Is there a workaround?

Environment

python==3.9
torch==1.13.1
pytorch_lightning==1.9.3
datasets==2.9.0
spacy==3.4.4

I use python provided from bazel-rules, all the requirements are installed with pip.
OS:Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-197-generic x86_64)
NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4
zerogerc commented 1 year ago

More info. I've remembered that I've recently updated our PL version. Tried to install pytorch_lightning==1.6.4 and everything works fine.

zerogerc commented 1 year ago

Hi, any updates on this? Still cannot update pl.

carmocca commented 1 year ago

We would need to reproduce it to be able to help. The script above gives me

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.