Closed maxjeblick closed 1 year ago
Could you share your context
list used in ernie_tokenize_layout(tokenizer, context, layout, labels)
? I want to reproduce the bug.
context
is the same as in the example, i.e. context = ['This is an example document', 'All ocr boxes are inserted into this list']
.
I think the issue is that tokenizer.encode(text, add_special_tokens=False)
and tokenizer.tokenize(text)
always returns an empty list []
together with the following warning:
sentencepiece_processor.cc(922) LOG(ERROR) src/sentencepiece_processor.cc(289) [model_] Model is not initialized.
Returns default value 0
I'm able to reproduce this issue both on an Ubuntu and Mac setup.
Maybe some issue w.r.t. Norm/ERNIE-Layout-Pytorch
files?
That's weird, I failed to reproduce this bug in my environment. It should not be the issue w.r.t. Norm/ERNIE-Layout-Pytorch
files. I have distributed these files to several developers and they didn't report any problems to me.
In my environment, sentencepiece== 0.1.97
; Probably the version mismatch leads to the warning?
Or could you do me a favor to set a debug point at this line ?
This is used to load the sentencepiece file. Check if this file can be loaded as expected
It looks like you are using the tokenizer with a broken sentencepiece vocab. Let's re-download the sentencepiece file and see what is going on.
Debugging yields:
name_or_path=Norm/ERNIE-Layout-Pytorch
sentencepiece_model_file=sentencepiece.bpe.model
>>> os.path.isfile(self.sentencepiece_model_file)
False
I was able to run the example code (tokenization) when manually downloading sentencepiece.bpe.model
and changing name_or_path
to the folder containing the corresponding files. It seems that ErnieLayoutTokenizer.from_pretrained(Norm/ERNIE-Layout-Pytorch)
will not download the sentencepiece model, but only config.json
and tokenizer_config.json
That is good to know. I will try to figure out why the downloading script misses the sentencepiece.bpe.model
. Thank you for your feedback
Running https://github.com/NormXU/ERNIE-Layout-Pytorch/blob/main/examples/test_ernie_token_cls.py gives the traceback attached (using
pretrain_torch_model_or_path = 'Norm/ERNIE-Layout-Pytorch'
) I had to runpip install sentencepiece
as it was missing inrequirements.txt
.I looks as if the sentencepiece tokenization may be broken, it returns an empty list for me.