AI4Bharat / Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT
https://indicnlp.ai4bharat.org
MIT License
276 stars 41 forks source link

Key error:KeyError: '[UNK]' [[{{node PyFunc}}]] [[IteratorGetNext]] #37

Closed kusumlata123 closed 2 years ago

kusumlata123 commented 2 years ago

raceback (most recent call last):

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func values = next(generator_state.get_iterator(iterator_id))

File "extract_features.py", line 244, in convert_examples_to_features window_size)

File "extract_features.py", line 188, in _convert_example_to_features input_ids = tokenizer.convert_tokens_to_ids(tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 242, in convert_tokens_to_ids return convert_by_vocab(self.vocab, tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 160, in convert_by_vocab output.append(vocab[item])

KeyError: '[UNK]'

ERROR:tensorflow:Error recorded from prediction_loop: exceptions.KeyError: '[UNK]' Traceback (most recent call last):

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func values = next(generator_state.get_iterator(iterator_id))

File "extract_features.py", line 244, in convert_examples_to_features window_size)

File "extract_features.py", line 188, in _convert_example_to_features input_ids = tokenizer.convert_tokens_to_ids(tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 242, in convert_tokens_to_ids return convert_by_vocab(self.vocab, tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 160, in convert_by_vocab output.append(vocab[item])

KeyError: '[UNK]'

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

E1223 17:05:18.867840 140097924953920 error_handling.py:75] Error recorded from prediction_loop: exceptions.KeyError: '[UNK]' Traceback (most recent call last):

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func values = next(generator_state.get_iterator(iterator_id))

File "extract_features.py", line 244, in convert_examples_to_features window_size)

File "extract_features.py", line 188, in _convert_example_to_features input_ids = tokenizer.convert_tokens_to_ids(tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 242, in convert_tokens_to_ids return convert_by_vocab(self.vocab, tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 160, in convert_by_vocab output.append(vocab[item])

KeyError: '[UNK]'

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

INFO:tensorflow:prediction_loop marked as finished I1223 17:05:18.869065 140097924953920 error_handling.py:101] prediction_loop marked as finished WARNING:tensorflow:Reraising captured error W1223 17:05:18.869143 140097924953920 error_handling.py:135] Reraising captured error 0%| | 0/2451534 [00:02<?, ?it/s] Traceback (most recent call last): File "extract_features.py", line 338, in tf.compat.v1.app.run() File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/absl/app.py", line 300, in run _run_main(main, args) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "extract_features.py", line 304, in main for result in estimator.predict(input_fn, yield_single_examples=True): File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3078, in predict rendezvous.raise_errors() File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors six.reraise(typ, value, traceback) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3072, in predict yield_single_examples=yield_single_examples): File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 640, in predict preds_evaluated = mon_sess.run(predictions) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run run_metadata=run_metadata) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run run_metadata=run_metadata) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run raise six.reraise(original_exc_info) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run return self._sess.run(args, *kwargs) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run run_metadata=run_metadata) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run return self._sess.run(args, **kwargs) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: exceptions.KeyError: '[UNK]' Traceback (most recent call last):

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func values = next(generator_state.get_iterator(iterator_id))

File "extract_features.py", line 244, in convert_examples_to_features window_size)

File "extract_features.py", line 188, in _convert_example_to_features input_ids = tokenizer.convert_tokens_to_ids(tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 242, in convert_tokens_to_ids return convert_by_vocab(self.vocab, tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 160, in convert_by_vocab output.append(vocab[item])

KeyError: '[UNK]'

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

why it is occuring

kusumlata123 commented 2 years ago

this is when im running on my dataset

gowtham1997 commented 2 years ago

Again, very hard to debug these things without a reproducible colab notebook.

From the files you shared in the email, I would try to check the tokenizer, as the error seems to be related to [UNK] (unknown token).

Can you try changing line 255 in extract_bert_features/extract_features.py from

tokenizer = tokenization.FullTokenizer(
        vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)

to

tokenizer = tokenization.FullTokenizer(
        vocab_file=FLAGS.vocab_file, 
        spm_model_file=<path to your sentencepiece tokenizer model>,
        do_lower_case=FLAGS.do_lower_case)

^ try adding the path to your sentencepiece tokenizer when initializing the tokenizer

kusumlata123 commented 2 years ago

Solved the problem.

On Thu, 23 Dec, 2021, 7:36 pm Gowtham.R, @.***> wrote:

Again, very hard to debug these things without a reproducible colab notebook.

From the files you shared in the email, I would try to check the tokenizer, as the error seems to be related to [UNK] (unknown token).

Can you try changing line 255 in extract_bert_features/extract_features.py from

tokenizer = tokenization.FullTokenizer( vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)

to

tokenizer = tokenization.FullTokenizer( vocab_file=FLAGS.vocab_file, spm_model_file=, do_lower_case=FLAGS.do_lower_case)

^ try adding the path to your sentencepiece tokenizer when initializing the tokenizer

— Reply to this email directly, view it on GitHub https://github.com/AI4Bharat/indic-bert/issues/37#issuecomment-1000327965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJKEARE3EWMBTNBYT22SF5TUSMUF7ANCNFSM5KUTPFSA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>