TypeError: can't pickle Tokenizer objects when num_workers > 0 and lazy = true

JohnGiorgi commented 4 years ago

Checklist

[X] I have verified that the issue exists against the master branch of AllenNLP.
[X] I have read the relevant section in the contribution guide on reporting bugs.
[X] I have checked the issues list for similar or identical bug reports.
[X] I have checked the pull requests list for existing proposed fixes.
[X] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
[X] I have included in the "Description" section below a traceback from any exceptions related to this bug.
[X] I have included in the "Related issues or possible duplicates" section below all related issues and possible duplicate issues (If there are none, check this box anyway).
[X] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[X] I have included in the "Environment" section below the output of pip freeze.
[X] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

I get a TypeError: can't pickle Tokenizer objects when trying to train a model that uses a PretrainedTransformerTokenizer tokenizer when "dataset_reader.lazy": true and "data_loader.num_workers" > 0. This appears to happen for every version of AllenNLP after 1.0.0rc3 (specifically this commit) including the current master branch. The 1.0.0rc3 release and earlier releases do not have this issue.

The notes in #4344 seem to suggest it has been solved, but I can still trigger it with a minimal example (see below).

Python traceback:

``` Traceback (most recent call last): File "/home/johnmg/t2t/bin/allennlp", line 33, in sys.exit(load_entry_point('allennlp', 'console_scripts', 'allennlp')()) File "/scratch/johnmg/allennlp/allennlp/__main__.py", line 24, in run main(prog="allennlp") File "/scratch/johnmg/allennlp/allennlp/commands/__init__.py", line 92, in main args.func(args) File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 112, in train_model_from_args dry_run=args.dry_run, File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 171, in train_model_from_file dry_run=dry_run, File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 295, in train_model nprocs=num_procs, File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 418, in _train_worker params=params, serialization_dir=serialization_dir, local_rank=process_rank, File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 580, in from_params **extras, File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 611, in from_params return constructor_to_call(**kwargs) # type: ignore File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 647, in from_partial_objects data_loader_ = data_loader.construct(dataset=datasets["train"]) File "/scratch/johnmg/allennlp/allennlp/common/lazy.py", line 46, in construct return self._constructor(**kwargs) File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 446, in constructor return value_cls.from_params(params=deepcopy(popped_params), **constructor_extras) File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 580, in from_params **extras, File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 611, in from_params return constructor_to_call(**kwargs) # type: ignore File "/scratch/johnmg/allennlp/allennlp/data/dataloader.py", line 151, in from_partial_objects batches_per_epoch=batches_per_epoch, File "/scratch/johnmg/allennlp/allennlp/data/dataloader.py", line 90, in __init__ self._data_generator = super().__iter__() File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__ return _MultiProcessingDataLoaderIter(self) File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 719, in __init__ w.start() File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle Tokenizer objects ```

Related issues or possible duplicates

4344

Environment

OS:

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Python version: 3.7.4

Output of pip freeze:

``` absl-py==0.7.1 aiohttp==3.6.2 alabaster==0.7.12 -e git+https://github.com/allenai/allennlp.git@b6fd6978b507ce6118023e23f3e4dbfa334d39b5#egg=allennlp apex==0.1 appdirs==1.4.3 aspy.yaml==1.3.0 astor==0.8.1 async-timeout==3.0.1 atomicwrites==1.3.0 attrs==19.3.0 Babel==2.7.0 backcall==0.1.0 beautifulsoup4==4.8.2 black==19.10b0 bleach==3.1.0 blis==0.2.4 boto==2.49.0 boto3==1.10.9 botocore==1.13.9 cachetools==3.1.1 cc-net==0.1.0 certifi==2019.9.11 cffi==1.13.2 cfgv==2.0.1 chardet==3.0.4 click==7.1.1 codecov==2.0.15 conllu==2.3.2 coverage==4.5.4 cryptography==2.8 cycler==0.10.0 cymem==2.0.2 -e git+https://github.com/JohnGiorgi/t2t.git@5cc03ed58253e12bd1060f1fea2b89bae3acdb84#egg=declutr decorator==4.4.1 dill==0.3.1.1 docutils==0.15.2 editdistance==0.5.2 en-core-web-sm==2.1.0 entrypoints==0.3 fastapi==0.58.0 fasttext==0.9.1 filelock==3.0.12 fire==0.2.1 flake8==3.7.9 flaky==3.6.1 Flask==1.1.1 Flask-Cors==3.0.8 ftfy==5.5.1 func-argparse==1.1.1 future==0.17.1 gast==0.2.2 gensim==3.8.1 getpy==0.9.9 gevent==1.4.0 google-auth==1.11.0 google-auth-oauthlib==0.4.1 google-pasta==0.1.8 greenlet==0.4.15 grpcio==1.25.0 h11==0.9.0 h5py==2.9.0 htmlmin==0.1.12 httptools==0.1.1 hypothesis==5.16.0 identify==1.4.10 idna==2.8 imagesize==1.1.0 importlib-metadata==0.23 ipython==7.10.1 ipython-genutils==0.2.0 isort==4.3.21 itsdangerous==1.1.0 jedi==0.15.1 jeepney==0.4.2 Jinja2==2.10.3 jmespath==0.9.4 joblib==0.14.0 jsmin==2.2.2 jsonnet==0.10.0 jsonpickle==1.2 jsonschema==3.0.2 kenlm==0.0.0 Keras-Applications==1.0.8 Keras-Preprocessing==1.1.0 keyring==21.1.0 kiwisolver==1.1.0 livereload==2.6.1 lxml==4.4.1 Markdown==3.1.1 markdown-include==0.5.1 MarkupSafe==1.1.1 mathy-pydoc==0.6.7 matplotlib==3.0.3 maturin==0.8.1 mccabe==0.6.1 mkdocs==1.0.4 mkdocs-material==4.6.3 mkdocs-minify-plugin==0.2.1 more-itertools==7.2.0 multidict==4.5.2 murmurhash==0.28.0 mypy==0.770 mypy-extensions==0.4.3 nltk==3.4 nodeenv==1.3.4 numpy==1.16.3 numpydoc==0.8.0 oauthlib==3.1.0 opt-einsum==2.3.2 overrides==3.1.0 packaging==19.2 pandas==0.25.3 parsimonious==0.8.0 parso==0.5.1 pathspec==0.7.0 pep562==1.0 pexpect==4.7.0 pickleshare==0.7.5 Pillow==6.2.1 Pillow-SIMD==7.0.0.post3 pkginfo==1.5.0.1 plac==0.9.6 pluggy==0.13.0 pre-commit==2.2.0 preshed==2.0.1 prompt-toolkit==3.0.2 protobuf==3.10.0 ptyprocess==0.6.0 py==1.8.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pybind11==2.4.3 pycodestyle==2.5.0 pycparser==2.19 pydantic==1.5.1 pydoc-markdown==2.0.5 pyflakes==2.1.1 Pygments==2.4.2 pymdown-extensions==6.3 pyparsing==2.4.3 pyrsistent==0.15.3 pytest==5.2.2 pytest-cov==2.8.1 python-dateutil==2.8.0 -e git+https://github.com/KevinMusgrave/pytorch-metric-learning.git@48de2dd9c4d78873d675f19187c5205075a6a9de#egg=pytorch_metric_learning pytz==2019.3 PyYAML==5.1.2 -e git+https://github.com/JohnGiorgi/QuickThought.git@397b8b18f3cc50a3471fe26f9725401fb2297816#egg=quickthought readme-renderer==24.0 regex==2018.1.10 requests==2.22.0 requests-oauthlib==1.3.0 requests-toolbelt==0.9.1 responses==0.10.6 rsa==4.0 ruamel.yaml==0.16.5 ruamel.yaml.clib==0.2.0 s3transfer==0.2.1 sacremoses==0.0.35 scikit-learn==0.21.2 scipy==1.4.1 SecretStorage==3.1.2 semantic-version==2.8.4 sentence-splitter==1.4 sentence-transformers==0.2.6.1 sentencepiece==0.1.82 setuptools-rust==0.10.6 singledispatch==3.4.0.3 six==1.12.0 smart-open==1.8.4 snowballstemmer==2.0.0 sortedcontainers==2.2.2 soupsieve==2.0 spacy==2.1.4 Sphinx==2.2.1 sphinxcontrib-applehelp==1.0.1 sphinxcontrib-devhelp==1.0.1 sphinxcontrib-htmlhelp==1.0.2 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.2 sphinxcontrib-serializinghtml==1.1.3 sqlparse==0.3.0 srsly==0.0.5 starlette==0.13.4 tensorboard==1.15.0 tensorboardX==1.9 tensorflow-estimator==1.15.1 tensorflow-gpu==1.15.0 tensorflow-hub==0.8.0 termcolor==1.1.0 Theano==1.0.1 thinc==7.0.4 tokenizers==0.7.0 toml==0.10.0 torch==1.5.0 torchvision==0.6.0+cu101 tornado==6.0.3 tqdm==4.37.0 traitlets==4.3.3 transformers==2.11.0 twine==3.1.1 typed-ast==1.4.1 typer==0.2.1 typing-extensions==3.7.4.1 Unidecode==1.1.1 urllib3==1.25.6 uvicorn==0.11.5 uvloop==0.14.0 virtualenv==16.7.9 wasabi==0.4.0 wcwidth==0.1.7 webencodings==0.5.1 websockets==8.1 Werkzeug==0.16.0 word2number==1.1 wrapt==1.11.2 yarl==1.4.2 zipp==0.6.0 ```

Steps to reproduce

Install a version of AllenNLP and AllenNLP-Models newer than 1.0.0rc3.
Train a model which uses a PretrainedTransformerTokenizer with "dataset_reader.lazy": true and "data_loader.num_workers" > 0. E.g. I used this config with some overrides (see below).

Example source:

```bash allennlp train mnli_roberta.jsonnet \ --serialization-dir ./debug \ --overrides "{'dataset_reader.lazy': true, 'data_loader.batch_sampler': null, 'data_loader.num_workers': 1}" \ -f ```

epwalsh commented 4 years ago

Hi @JohnGiorgi, can you share your config? Are you using the num_workers option with your data loader?

JohnGiorgi commented 4 years ago

Hi @epwalsh, yes, it looks like num_workers > 0 was the culprit here. I just noticed that the logger prints:

UserWarning: Using multi-process data loading without setting DatasetReader.manual_multi_process_sharding to True.
Did you forget to set this?
If you're not handling the multi-process sharding logic within your _read() method, there is probably no benefit to using more than one worker.

so maybe my issue is unnecessary and I should leave num_workers at its default? (I confirmed the error does not happen when num_workers is unset).

In any case, I have updated my original issue with a minimal example that triggers the error.

epwalsh commented 4 years ago

Gotcha. Yea, like the warning says there is probably no benefit to using num_workers > 0 unless you implement some custom logic within _read() to handle that.

But even then, you'll probably still see his exception, which arises because each TextField within each of your data Instances includes a PretrainedTransformerIndexer, which itself wraps a HuggingFace Tokenizer object.

Now when the main process loading data needs to gather the Instances from the data loading workers, it uses pickle to communicate. But since HuggingFace Tokenizers currently can't be pickled, this error is raised.

epwalsh commented 4 years ago

That said, we are planning on making some changes to our data loading story soon. One of the proposed changes is to make Instances / Fields pure data objects - i.e. with no references to tokenizers, token indexers, or anything else - which would solve this particular issue without requiring the HuggingFace tokenizers to be pickle-able.

JohnGiorgi commented 4 years ago

@epwalsh Gotcha, thanks for the detailed response.

For now, I will leave num_workers unset (I think I only set it to 1 in the first place as it gave me a small reduction in train time, but I don't actually remember).

I will lookout for the proposed changes to the Instance/Field objects :)

allenai / allennlp