How do you train with those NOT-TEXT elements.

doc-analysis / DocBank

DocBank: A Benchmark Dataset for Document Layout Analysis

Apache License 2.0

583 stars 72 forks source link

How do you train with those NOT-TEXT elements. #21

Open linan142857 opened 4 years ago

linan142857 commented 4 years ago

Dear author, For some documents that contain massive not-text elements, such as hundreds of thousands of "##LTLine##". How do you deal with them actually? For example, you try to train&predict all those elements with text '##LTLine##'.

Thank you!

liminghao1630 commented 4 years ago

Yes. We regard '##LTLine##' as a special token during train and predict.

NandreyN commented 3 years ago

Yes. We regard '##LTLine##' as a special token during train and predict.

Hi! Could you please tell integer identifiers of ##LTLine## and ##LTFigure## tokens within LayoutLM's vocabulary?

Thanks

liminghao1630 commented 3 years ago

In fact, we did not add them to the vocabulary. They will also be tokenized into tokens and labeled in the way I mentioned at #25.

NandreyN commented 3 years ago

Thanks