Closed yunzhongOvO closed 4 years ago
What is your tokenizers version?
What is your tokenizers version?
what does "tokenizers" mean?Do I need to install it specifically?
what‘s more, I also got errors when transforming wikibooks and wikipedia to tfrecords:
$python3 -m preprocess --dataset=wikibooks --shards=2048 --processes=64 --cache_dir=/data/bert_train/wikibooks --tfrecords_dir=/data/bert_train/wikibooks_512seq/
2020-08-08 15:21:29.765348: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Loading dataset: wikibooks
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/huzhongzhe/deep-learning-models/models/nlp/common/preprocess.py", line 113, in <module>
dset_wikipedia._data = dset_wikipedia.data.cast(dset_books.schema)
AttributeError: 'Dataset' object has no attribute 'schema'
$ python3 -m preprocess --dataset=wikipedia --shards=2048 --processes=64 --cache_dir=/data/bert_train/wikipedia --tfrecords_dir=/data/bert_train/wikipedia_512seq/
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/huzhongzhe/deep-learning-models/models/nlp/common/preprocess.py", line 127, in <module>
load_from_cache_file=load_from_cache_file,
File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_dataset.py", line 1064, in filter
verbose=verbose,
File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_dataset.py", line 960, in map
writer.write_batch(batch)
File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_writer.py", line 190, in write_batch
pa_table: pa.Table = pa.Table.from_pydict(batch_examples, schema=self._schema)
File "pyarrow/types.pxi", line 933, in __iter__
KeyError: "The passed mapping doesn't contain the following field(s) of the schema: title"
What is your tokenizers version?
what does "tokenizers" mean?Do I need to install it specifically?
what‘s more, I also got errors when transforming wikibooks and wikipedia to tfrecords:
$python3 -m preprocess --dataset=wikibooks --shards=2048 --processes=64 --cache_dir=/data/bert_train/wikibooks --tfrecords_dir=/data/bert_train/wikibooks_512seq/ 2020-08-08 15:21:29.765348: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 Loading dataset: wikibooks Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/huzhongzhe/deep-learning-models/models/nlp/common/preprocess.py", line 113, in <module> dset_wikipedia._data = dset_wikipedia.data.cast(dset_books.schema) AttributeError: 'Dataset' object has no attribute 'schema'
$ python3 -m preprocess --dataset=wikipedia --shards=2048 --processes=64 --cache_dir=/data/bert_train/wikipedia --tfrecords_dir=/data/bert_train/wikipedia_512seq/ Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/huzhongzhe/deep-learning-models/models/nlp/common/preprocess.py", line 127, in <module> load_from_cache_file=load_from_cache_file, File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_dataset.py", line 1064, in filter verbose=verbose, File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_dataset.py", line 960, in map writer.write_batch(batch) File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_writer.py", line 190, in write_batch pa_table: pa.Table = pa.Table.from_pydict(batch_examples, schema=self._schema) File "pyarrow/types.pxi", line 933, in __iter__ KeyError: "The passed mapping doesn't contain the following field(s) of the schema: title"
I figured out the tokenizers version is 0.8.1rc1
Related to your first issue, there's an unreleased fix in the master branch of tokenizers. Running these commands to install tokenizers from source will fix that.
The schema
attribute was recently removed in nlp==0.4.0
via https://github.com/huggingface/nlp/pull/423, so I've updated the script to reflect that.
Errors occurs when running
preprocessor.py
. The data is bookcorpus.I run in Tensorflow, and already installed nlp0.4.0 and transformers3.0.2