test_ernie_token_cls gives UnboundLocalError: local variable 'eop_idx' referenced before assignment

maxjeblick commented 1 year ago

Running https://github.com/NormXU/ERNIE-Layout-Pytorch/blob/main/examples/test_ernie_token_cls.py gives the traceback attached (using pretrain_torch_model_or_path = 'Norm/ERNIE-Layout-Pytorch') I had to run pip install sentencepiece as it was missing in requirements.txt.

I looks as if the sentencepiece tokenization may be broken, it returns an empty list for me.

sentencepiece_processor.cc(922) LOG(ERROR) src/sentencepiece_processor.cc(289) [model_] Model is not initialized.
Returns default value 0
Traceback (most recent call last):
  File "/media/max/3tb drive/PycharmProjects/ERNIE-Layout-Pytorch/examples/test_ernie_token_cls.py", line 48, in <module>
    main()
  File "/media/max/3tb drive/PycharmProjects/ERNIE-Layout-Pytorch/examples/test_ernie_token_cls.py", line 25, in main
    tokenized_res = ernie_tokenize_layout(tokenizer, context, layout, labels)
  File "/media/max/3tb drive/PycharmProjects/ERNIE-Layout-Pytorch/networks/model_util.py", line 127, in ernie_tokenize_layout
    context_encodings = prepare_context_info(tokenizer, context, layout)
  File "/media/max/3tb drive/PycharmProjects/ERNIE-Layout-Pytorch/networks/model_util.py", line 81, in prepare_context_info
    missing_tail_blank = len(ctx) - eop_idx
UnboundLocalError: local variable 'eop_idx' referenced before assignment

Process finished with exit code 1

NormXU commented 1 year ago

Could you share your context list used in ernie_tokenize_layout(tokenizer, context, layout, labels)? I want to reproduce the bug.

maxjeblick commented 1 year ago

context is the same as in the example, i.e. context = ['This is an example document', 'All ocr boxes are inserted into this list']. I think the issue is that tokenizer.encode(text, add_special_tokens=False) and tokenizer.tokenize(text) always returns an empty list [] together with the following warning:

sentencepiece_processor.cc(922) LOG(ERROR) src/sentencepiece_processor.cc(289) [model_] Model is not initialized.
Returns default value 0

I'm able to reproduce this issue both on an Ubuntu and Mac setup. Maybe some issue w.r.t. Norm/ERNIE-Layout-Pytorch files?

NormXU commented 1 year ago

That's weird, I failed to reproduce this bug in my environment. It should not be the issue w.r.t. Norm/ERNIE-Layout-Pytorch files. I have distributed these files to several developers and they didn't report any problems to me.

In my environment, sentencepiece== 0.1.97; Probably the version mismatch leads to the warning?

Or could you do me a favor to set a debug point at this line ?

This is used to load the sentencepiece file. Check if this file can be loaded as expected

It looks like you are using the tokenizer with a broken sentencepiece vocab. Let's re-download the sentencepiece file and see what is going on.

maxjeblick commented 1 year ago

Debugging yields: name_or_path=Norm/ERNIE-Layout-Pytorch sentencepiece_model_file=sentencepiece.bpe.model

>>> os.path.isfile(self.sentencepiece_model_file)
False

I was able to run the example code (tokenization) when manually downloading sentencepiece.bpe.model and changing name_or_path to the folder containing the corresponding files. It seems that ErnieLayoutTokenizer.from_pretrained(Norm/ERNIE-Layout-Pytorch) will not download the sentencepiece model, but only config.json and tokenizer_config.json

NormXU commented 1 year ago

That is good to know. I will try to figure out why the downloading script misses the sentencepiece.bpe.model. Thank you for your feedback

NormXU / ERNIE-Layout-Pytorch

test_ernie_token_cls gives UnboundLocalError: local variable 'eop_idx' referenced before assignment #6