LibraryOfCongress / newspaper-navigator

The Unlicense
235 stars 28 forks source link

Reading Order of Textlines #1

Closed ghost closed 4 years ago

ghost commented 4 years ago

@bcglee Thank you for your hard work,

Lets say that I detected Text-Lines, the issue is that Detectron2 will save the boxes randomly, without a layout structure. Setting heuristic rules, example: documents with a single column, 3 columns with images, etc.. is time consuming, and there isn't a 1 size fits all rule, especially with very complex layouts.

The Question:

Before

before

After after

bcglee commented 4 years ago

Thanks so much for your kind words, @deepseek, and great question! If you just have the bounding box coordinates, I might recommend using a clustering algorithm (or the like) to detect the larger structural units such as paragraphs, then sorting from the top to the bottom of the page, then left to right (for papers in English). If you have access to the underlying text via OCR, I'd recommend training an ML model on the text in each line, along with the corresponding bounding box coordinates - leveraging the text in each line should improve the ability to identify coherent paragraphs and articles, which in turn will help with page structure. If you're interested in article segmentation, I'd recommend taking a look at the work being done by the IMPRESSO project, the NEWSEYE project, KB Lab Research, and the Google Newspaper Project. I suspect their work may be of interest to you!