How I can extract Titles, Headers , Photos and respective article information from Newspaper?

Layout-Parser / layout-parser

A Unified Toolkit for Deep Learning Based Document Image Analysis

Apache License 2.0

4.64k stars 449 forks source link

You are asking for a complete document layout task! This is not an issue, its a task. Combine object detection (bigger bboxes) with pdf_parser output (bboxes for every word or line). Filter the lines/words output by the bigger boxes predicted by Vision Models. You can leverage spatial correlation (sort by width, then height) to identify words in the same line or a heading above a paragraph (heading will be one-liner, identified a bbox with bigger area than others plus height of heading < height of paragraph). Hope that helps 👯

Layout-Parser / layout-parser

How I can extract Titles, Headers , Photos and respective article information from Newspaper? #172