Reading Order of Textlines

LibraryOfCongress / newspaper-navigator

The Unlicense

235 stars 28 forks source link

@bcglee Thank you for your hard work,

Lets say that I detected Text-Lines, the issue is that Detectron2 will save the boxes randomly, without a layout structure. Setting heuristic rules, example: documents with a single column, 3 columns with images, etc.. is time consuming, and there isn't a 1 size fits all rule, especially with very complex layouts.

The Question:

Is there a machine learning algorithm that can be trained to learn sorting/ ranking the Text-Lines, organizing them to the correct Reading Order? Or perhaps rank object, etc...?

Before

before

After after

Thanks so much for your kind words, @deepseek, and great question! If you just have the bounding box coordinates, I might recommend using a clustering algorithm (or the like) to detect the larger structural units such as paragraphs, then sorting from the top to the bottom of the page, then left to right (for papers in English). If you have access to the underlying text via OCR, I'd recommend training an ML model on the text in each line, along with the corresponding bounding box coordinates - leveraging the text in each line should improve the ability to identify coherent paragraphs and articles, which in turn will help with page structure. If you're interested in article segmentation, I'd recommend taking a look at the work being done by the IMPRESSO project, the NEWSEYE project, KB Lab Research, and the Google Newspaper Project. I suspect their work may be of interest to you!

LibraryOfCongress / newspaper-navigator

Reading Order of Textlines #1