aws-samples / deep-learning-models

Natural language processing & computer vision models optimized for AWS
Other
140 stars 75 forks source link

Export to TFRecords Error #30

Closed yunzhongOvO closed 4 years ago

yunzhongOvO commented 4 years ago

Errors occurs when running preprocessor.py . The data is bookcorpus.

$ python3 -m preprocess --dataset=bookcorpus --shards=2048 --processes=64 --cache_dir=/data/bert_train/bookcorpus --tfrecords_dir=/data/bert_train/bookcorpus_512seq

 File "/home/huzhongzhe/deep-learning-models/models/nlp/common/preprocess.py",
line 170, in <module>
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1140, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/transformers/tokenization_bert.py", line 623, in __init__
    wordpieces_prefix=wordpieces_prefix,
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/tokenizers/implementations/bert_wordpiece.py", line 57, in __init__
    raise TypeError("sep_token not found in the vocabulary")
TypeError: sep_token not found in the vocabulary

I run in Tensorflow, and already installed nlp0.4.0 and transformers3.0.2

jarednielsen commented 4 years ago

What is your tokenizers version?

yunzhongOvO commented 4 years ago

What is your tokenizers version?

what does "tokenizers" mean?Do I need to install it specifically?

what‘s more, I also got errors when transforming wikibooks and wikipedia to tfrecords:

$python3 -m preprocess --dataset=wikibooks --shards=2048 --processes=64 --cache_dir=/data/bert_train/wikibooks --tfrecords_dir=/data/bert_train/wikibooks_512seq/

2020-08-08 15:21:29.765348: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Loading dataset: wikibooks
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/huzhongzhe/deep-learning-models/models/nlp/common/preprocess.py", line 113, in <module>
    dset_wikipedia._data = dset_wikipedia.data.cast(dset_books.schema)
AttributeError: 'Dataset' object has no attribute 'schema'
$ python3 -m preprocess --dataset=wikipedia --shards=2048 --processes=64 --cache_dir=/data/bert_train/wikipedia --tfrecords_dir=/data/bert_train/wikipedia_512seq/

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/huzhongzhe/deep-learning-models/models/nlp/common/preprocess.py", line 127, in <module>
    load_from_cache_file=load_from_cache_file,
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_dataset.py", line 1064, in filter
    verbose=verbose,
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_dataset.py", line 960, in map
    writer.write_batch(batch)
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_writer.py", line 190, in write_batch
    pa_table: pa.Table = pa.Table.from_pydict(batch_examples, schema=self._schema)
  File "pyarrow/types.pxi", line 933, in __iter__
KeyError: "The passed mapping doesn't contain the following field(s) of the schema: title"
yunzhongOvO commented 4 years ago

What is your tokenizers version?

what does "tokenizers" mean?Do I need to install it specifically?

what‘s more, I also got errors when transforming wikibooks and wikipedia to tfrecords:

$python3 -m preprocess --dataset=wikibooks --shards=2048 --processes=64 --cache_dir=/data/bert_train/wikibooks --tfrecords_dir=/data/bert_train/wikibooks_512seq/

2020-08-08 15:21:29.765348: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Loading dataset: wikibooks
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/huzhongzhe/deep-learning-models/models/nlp/common/preprocess.py", line 113, in <module>
    dset_wikipedia._data = dset_wikipedia.data.cast(dset_books.schema)
AttributeError: 'Dataset' object has no attribute 'schema'
$ python3 -m preprocess --dataset=wikipedia --shards=2048 --processes=64 --cache_dir=/data/bert_train/wikipedia --tfrecords_dir=/data/bert_train/wikipedia_512seq/

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/huzhongzhe/deep-learning-models/models/nlp/common/preprocess.py", line 127, in <module>
    load_from_cache_file=load_from_cache_file,
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_dataset.py", line 1064, in filter
    verbose=verbose,
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_dataset.py", line 960, in map
    writer.write_batch(batch)
  File "/home/huzhongzhe/.local/lib/python3.6/site-packages/nlp/arrow_writer.py", line 190, in write_batch
    pa_table: pa.Table = pa.Table.from_pydict(batch_examples, schema=self._schema)
  File "pyarrow/types.pxi", line 933, in __iter__
KeyError: "The passed mapping doesn't contain the following field(s) of the schema: title"

I figured out the tokenizers version is 0.8.1rc1

jarednielsen commented 4 years ago

Related to your first issue, there's an unreleased fix in the master branch of tokenizers. Running these commands to install tokenizers from source will fix that.

https://github.com/aws-samples/deep-learning-models/blob/5b1194a862c026a30d5dfff46979cb7e80869e81/models/nlp/Dockerfile#L122-L128

The schema attribute was recently removed in nlp==0.4.0 via https://github.com/huggingface/nlp/pull/423, so I've updated the script to reflect that.