Closed JohnGiorgi closed 4 years ago
Hi @JohnGiorgi, can you share your config? Are you using the num_workers
option with your data loader?
Hi @epwalsh, yes, it looks like num_workers > 0
was the culprit here. I just noticed that the logger prints:
UserWarning: Using multi-process data loading without setting DatasetReader.manual_multi_process_sharding to True.
Did you forget to set this?
If you're not handling the multi-process sharding logic within your _read() method, there is probably no benefit to using more than one worker.
so maybe my issue is unnecessary and I should leave num_workers
at its default? (I confirmed the error does not happen when num_workers
is unset).
In any case, I have updated my original issue with a minimal example that triggers the error.
Gotcha. Yea, like the warning says there is probably no benefit to using num_workers > 0
unless you implement some custom logic within _read()
to handle that.
But even then, you'll probably still see his exception, which arises because each TextField
within each of your data Instance
s includes a PretrainedTransformerIndexer
, which itself wraps a HuggingFace Tokenizer
object.
Now when the main process loading data needs to gather the Instance
s from the data loading workers, it uses pickle
to communicate. But since HuggingFace Tokenizer
s currently can't be pickled, this error is raised.
That said, we are planning on making some changes to our data loading story soon. One of the proposed changes is to make Instance
s / Field
s pure data objects - i.e. with no references to tokenizers, token indexers, or anything else - which would solve this particular issue without requiring the HuggingFace tokenizers to be pickle-able.
@epwalsh Gotcha, thanks for the detailed response.
For now, I will leave num_workers
unset (I think I only set it to 1 in the first place as it gave me a small reduction in train time, but I don't actually remember).
I will lookout for the proposed changes to the Instance
/Field
objects :)
Checklist
master
branch of AllenNLP.pip freeze
.Description
I get a
TypeError: can't pickle Tokenizer objects
when trying to train a model that uses aPretrainedTransformerTokenizer
tokenizer when"dataset_reader.lazy": true
and"data_loader.num_workers" > 0
. This appears to happen for every version of AllenNLP after 1.0.0rc3 (specifically this commit) including the current master branch. The 1.0.0rc3 release and earlier releases do not have this issue.The notes in #4344 seem to suggest it has been solved, but I can still trigger it with a minimal example (see below).
Python traceback:
``` Traceback (most recent call last): File "/home/johnmg/t2t/bin/allennlp", line 33, in
sys.exit(load_entry_point('allennlp', 'console_scripts', 'allennlp')())
File "/scratch/johnmg/allennlp/allennlp/__main__.py", line 24, in run
main(prog="allennlp")
File "/scratch/johnmg/allennlp/allennlp/commands/__init__.py", line 92, in main
args.func(args)
File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 112, in train_model_from_args
dry_run=args.dry_run,
File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 171, in train_model_from_file
dry_run=dry_run,
File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 295, in train_model
nprocs=num_procs,
File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 418, in _train_worker
params=params, serialization_dir=serialization_dir, local_rank=process_rank,
File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 580, in from_params
**extras,
File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 611, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 647, in from_partial_objects
data_loader_ = data_loader.construct(dataset=datasets["train"])
File "/scratch/johnmg/allennlp/allennlp/common/lazy.py", line 46, in construct
return self._constructor(**kwargs)
File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 446, in constructor
return value_cls.from_params(params=deepcopy(popped_params), **constructor_extras)
File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 580, in from_params
**extras,
File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 611, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/scratch/johnmg/allennlp/allennlp/data/dataloader.py", line 151, in from_partial_objects
batches_per_epoch=batches_per_epoch,
File "/scratch/johnmg/allennlp/allennlp/data/dataloader.py", line 90, in __init__
self._data_generator = super().__iter__()
File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
return _MultiProcessingDataLoaderIter(self)
File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
w.start()
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Tokenizer objects
```
Related issues or possible duplicates
4344
Environment
OS:
Python version: 3.7.4
Output of
pip freeze
:``` absl-py==0.7.1 aiohttp==3.6.2 alabaster==0.7.12 -e git+https://github.com/allenai/allennlp.git@b6fd6978b507ce6118023e23f3e4dbfa334d39b5#egg=allennlp apex==0.1 appdirs==1.4.3 aspy.yaml==1.3.0 astor==0.8.1 async-timeout==3.0.1 atomicwrites==1.3.0 attrs==19.3.0 Babel==2.7.0 backcall==0.1.0 beautifulsoup4==4.8.2 black==19.10b0 bleach==3.1.0 blis==0.2.4 boto==2.49.0 boto3==1.10.9 botocore==1.13.9 cachetools==3.1.1 cc-net==0.1.0 certifi==2019.9.11 cffi==1.13.2 cfgv==2.0.1 chardet==3.0.4 click==7.1.1 codecov==2.0.15 conllu==2.3.2 coverage==4.5.4 cryptography==2.8 cycler==0.10.0 cymem==2.0.2 -e git+https://github.com/JohnGiorgi/t2t.git@5cc03ed58253e12bd1060f1fea2b89bae3acdb84#egg=declutr decorator==4.4.1 dill==0.3.1.1 docutils==0.15.2 editdistance==0.5.2 en-core-web-sm==2.1.0 entrypoints==0.3 fastapi==0.58.0 fasttext==0.9.1 filelock==3.0.12 fire==0.2.1 flake8==3.7.9 flaky==3.6.1 Flask==1.1.1 Flask-Cors==3.0.8 ftfy==5.5.1 func-argparse==1.1.1 future==0.17.1 gast==0.2.2 gensim==3.8.1 getpy==0.9.9 gevent==1.4.0 google-auth==1.11.0 google-auth-oauthlib==0.4.1 google-pasta==0.1.8 greenlet==0.4.15 grpcio==1.25.0 h11==0.9.0 h5py==2.9.0 htmlmin==0.1.12 httptools==0.1.1 hypothesis==5.16.0 identify==1.4.10 idna==2.8 imagesize==1.1.0 importlib-metadata==0.23 ipython==7.10.1 ipython-genutils==0.2.0 isort==4.3.21 itsdangerous==1.1.0 jedi==0.15.1 jeepney==0.4.2 Jinja2==2.10.3 jmespath==0.9.4 joblib==0.14.0 jsmin==2.2.2 jsonnet==0.10.0 jsonpickle==1.2 jsonschema==3.0.2 kenlm==0.0.0 Keras-Applications==1.0.8 Keras-Preprocessing==1.1.0 keyring==21.1.0 kiwisolver==1.1.0 livereload==2.6.1 lxml==4.4.1 Markdown==3.1.1 markdown-include==0.5.1 MarkupSafe==1.1.1 mathy-pydoc==0.6.7 matplotlib==3.0.3 maturin==0.8.1 mccabe==0.6.1 mkdocs==1.0.4 mkdocs-material==4.6.3 mkdocs-minify-plugin==0.2.1 more-itertools==7.2.0 multidict==4.5.2 murmurhash==0.28.0 mypy==0.770 mypy-extensions==0.4.3 nltk==3.4 nodeenv==1.3.4 numpy==1.16.3 numpydoc==0.8.0 oauthlib==3.1.0 opt-einsum==2.3.2 overrides==3.1.0 packaging==19.2 pandas==0.25.3 parsimonious==0.8.0 parso==0.5.1 pathspec==0.7.0 pep562==1.0 pexpect==4.7.0 pickleshare==0.7.5 Pillow==6.2.1 Pillow-SIMD==7.0.0.post3 pkginfo==1.5.0.1 plac==0.9.6 pluggy==0.13.0 pre-commit==2.2.0 preshed==2.0.1 prompt-toolkit==3.0.2 protobuf==3.10.0 ptyprocess==0.6.0 py==1.8.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pybind11==2.4.3 pycodestyle==2.5.0 pycparser==2.19 pydantic==1.5.1 pydoc-markdown==2.0.5 pyflakes==2.1.1 Pygments==2.4.2 pymdown-extensions==6.3 pyparsing==2.4.3 pyrsistent==0.15.3 pytest==5.2.2 pytest-cov==2.8.1 python-dateutil==2.8.0 -e git+https://github.com/KevinMusgrave/pytorch-metric-learning.git@48de2dd9c4d78873d675f19187c5205075a6a9de#egg=pytorch_metric_learning pytz==2019.3 PyYAML==5.1.2 -e git+https://github.com/JohnGiorgi/QuickThought.git@397b8b18f3cc50a3471fe26f9725401fb2297816#egg=quickthought readme-renderer==24.0 regex==2018.1.10 requests==2.22.0 requests-oauthlib==1.3.0 requests-toolbelt==0.9.1 responses==0.10.6 rsa==4.0 ruamel.yaml==0.16.5 ruamel.yaml.clib==0.2.0 s3transfer==0.2.1 sacremoses==0.0.35 scikit-learn==0.21.2 scipy==1.4.1 SecretStorage==3.1.2 semantic-version==2.8.4 sentence-splitter==1.4 sentence-transformers==0.2.6.1 sentencepiece==0.1.82 setuptools-rust==0.10.6 singledispatch==3.4.0.3 six==1.12.0 smart-open==1.8.4 snowballstemmer==2.0.0 sortedcontainers==2.2.2 soupsieve==2.0 spacy==2.1.4 Sphinx==2.2.1 sphinxcontrib-applehelp==1.0.1 sphinxcontrib-devhelp==1.0.1 sphinxcontrib-htmlhelp==1.0.2 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.2 sphinxcontrib-serializinghtml==1.1.3 sqlparse==0.3.0 srsly==0.0.5 starlette==0.13.4 tensorboard==1.15.0 tensorboardX==1.9 tensorflow-estimator==1.15.1 tensorflow-gpu==1.15.0 tensorflow-hub==0.8.0 termcolor==1.1.0 Theano==1.0.1 thinc==7.0.4 tokenizers==0.7.0 toml==0.10.0 torch==1.5.0 torchvision==0.6.0+cu101 tornado==6.0.3 tqdm==4.37.0 traitlets==4.3.3 transformers==2.11.0 twine==3.1.1 typed-ast==1.4.1 typer==0.2.1 typing-extensions==3.7.4.1 Unidecode==1.1.1 urllib3==1.25.6 uvicorn==0.11.5 uvloop==0.14.0 virtualenv==16.7.9 wasabi==0.4.0 wcwidth==0.1.7 webencodings==0.5.1 websockets==8.1 Werkzeug==0.16.0 word2number==1.1 wrapt==1.11.2 yarl==1.4.2 zipp==0.6.0 ```
Steps to reproduce
PretrainedTransformerTokenizer
with"dataset_reader.lazy": true
and"data_loader.num_workers" > 0
. E.g. I used this config with some overrides (see below).Example source:
```bash allennlp train mnli_roberta.jsonnet \ --serialization-dir ./debug \ --overrides "{'dataset_reader.lazy': true, 'data_loader.batch_sampler': null, 'data_loader.num_workers': 1}" \ -f ```