LiLt segment positions and Tokenizer

Hi @NielsRogge. Thank you so much for your notebook, it's been helpful to the huggingface community.

I'll first start by giving some context about my use case. I'm currently fine-tuning the LiLt model on a custom dataset (invoices). I'm using the FUNSD Format. I don't know if this is the correct approach, but in order to capture segments (let's say a class A that has the text : "ABC 123 DE") I use the B- I- E- S- format ('ABC' is considered as B-A, '123' as I-A and 'DE' as E-A. IF it was a single word it would be considered as 'S-A'). I used this approach to solve the segment issue, but I saw that you mentioned ''Please always use an OCR engine that can recognize segments, and use the same bounding boxes for all words that make up a segment. This will greatly improve performance.". Does that mean that I could considered the whole "ABC 123 DE" as one word (segment) with the same bounding box (concatenating the 3 bounding boxed of the 3 words) ? I'm still talking about the fine-tuning (training) part.

You also mentioned that, for inference (I guess), we need to use an OCR engine that gives segments. what are some suggestions ? Can I use the Google cloud vision OCR for that ?

Also, regarding the Roberta model, what model do you suggest for French language please, instead of Roberta-en-base ?

Thank you again for your work, I would appreciate your response

Best

NielsRogge / Transformers-Tutorials

LiLt segment positions and Tokenizer #311