Error when non-UTF-8 encoding char occurs in a text during inference

ahmedbr commented 1 year ago

Hi,

I got the following error while running inference on some text samples. After some investigations, it seems that the error occurs whenever an input text has a non-utf-8 encoding character. In such a case, the difference in size between pred and segment arrays' size in "arabiner/trainers/BertNestedTrainer.py" line 187-188 is more than 1 due to the non-utf-8 char(s) in sample text input. (To be confirmed?)

Traceback (most recent call last): File "ner_tester.py", line 35, in <module> run_inference_for_file(file_path) File "ner_tester.py", line 23, in run_inference_for_file batch_list = inference_model.predict(ner_inputs, lang) File "/app/model/ner_inference.py", line 148, in predict segments = self.tagger.infer(dataloader) File "/app/arabiner/trainers/BertNestedTrainer.py", line 174, in infer segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab) File "/app/arabiner/trainers/BertNestedTrainer.py", line 193, in to_segments for tag_id, vocab in zip(pred[i, :].int().tolist(), vocab.tags[1:])] IndexError: index 146 is out of bounds for dimension 0 with size 146

You may want to run the inference code using the following text sample to reproduce the error:

text_sampel = "يبدو أن فكر التنظيم الداعشيّ -الذي ينتشر بصورةٍ واسعة عبر وسائل التواصل الاجتماعي، ومقاطع فيديو دعائية بارعة- قد نجح في إلهام موجةٍ من العنف على مدار ما يزيد عن عامٍ: تتضمن إطلاق النار في سان بيرناردينيو بكاليفورنيا، وقتل العديد من رواد مقهى للمثليين بأورلاندو في شهر ‏ ‏‏يونيو/‏‏حزيران، والهجمة القاتلة في أول شهر ‏يوليو/‏تموز 2016 على مقهى آخر ببنغلاديش. يُضاف إليها الهجمات التي يُرجح أن واضعي خططها هم أكبر مُهندسي العمليات في الدولة الإسلامية، مثل هجمات باريس في نوفمبر /‏تشرين الثاني 2015، وتفجيرات بروكسل في مارس/‏آذار ‏2016. ‏"

mohammedkhalilia commented 1 year ago

Hi Ahmed,

I tested inference on the text you provided using the command line below and it worked without errors. it seems you are wrapping the inference with other scripts (ner_inference.py and ner_tester.py) that are not part of the ArabicNER package. To isolate the problem to ArabicNER only, can you please run the command line below and see if it works.

arabicner/bin/python -u arabiner/bin/infer.py \
      --model_path path/to/model 
      --batch_size 16 
       --text "يبدو أن فكر التنظيم الداعشيّ -الذي ينتشر بصورةٍ واسعة عبر وسائل التواصل الاجتماعي، ومقاطع فيديو دعائية بارعة- قد نجح في إلهام موجةٍ من العنف على مدار ما يزيد عن عامٍ: تتضمن إطلاق النار في سان بيرناردينيو بكاليفورنيا، وقتل العديد من رواد مقهى للمثليين بأورلاندو في شهر ‏ ‏‏يونيو/‏‏حزيران، والهجمة القاتلة في أول شهر ‏يوليو/‏تموز 2016 على مقهى آخر ببنغلاديش. يُضاف إليها الهجمات التي يُرجح أن واضعي خططها هم أكبر مُهندسي العمليات في الدولة الإسلامية، مثل هجمات باريس في نوفمبر /‏تشرين الثاني 2015، وتفجيرات بروكسل في مارس/‏آذار ‏2016. ‏"

mohammedkhalilia commented 1 year ago

I will close this issue for lack of response and inactivity.

SinaLab / ArabicNER

Error when non-UTF-8 encoding char occurs in a text during inference #1