Identify article titles

OlivierBinette commented 3 years ago

Determine a baseline approach to identify article titles using tesseract output (#1) and extracted features (#2).

BrandonBae commented 3 years ago

I have recently started work on building upon Neel's hOCR extraction script (All my work thus far can be found in the hocrTitleExtraction branch). So far I have created a simple python script that adds a column to the outputted csv stating whether or not a word is part of a title or not. As for now I simply observed in Neel's test csv file that "title" words had a line height above 100 and used this as my reference point. Additionally, I have started work on a script that will go through the outputted csv's and create .txt files for each article. My plan is to iterate through the csv file and create a txt file when the iterator points to a "title" word and then write in all the following "non title" words into the file until it hits another "title" word. However, this method is extremely basic and I am open to any ideas/suggestions/criticisms.

OlivierBinette commented 3 years ago

I like this idea. Instead of writing to a separate text file however, you could try the following:

Given the csv file with the "title" column, create a new column named "partID" as follows:

Initialize the vector partID and set id = 0
Iterate through rows i:
- Set partID[i] = id
- If row i+1 corresponds to a title, then id += 1

This way each row is assigned an identifier of the article part to which it corresponds.

neel216 commented 3 years ago

I created a visualizer for Brandon's title identifier he created from https://github.com/Duke-Chronicle-Project/article-extraction/issues/5#issuecomment-787617841 and https://github.com/Duke-Chronicle-Project/article-extraction/issues/6#issuecomment-796279774. The output can be seen below: I made a quick change to Brandon's identifier to filter out words that were empty (images and misinterpreted whitespace), and got the following: Although it does get rid of the obvious false-positives, we're clearly missing several titles. I took a quick look at the hocr-js visualization of the hOCR scan we're testing on and noticed that the titles that aren't highlighted by my own visualizer are definitely being highlighted by hocr-js under what I think is a paragraph section (the word is correctly scanned as well). This leads me to believe that maybe we should be judging line height by a different metric, like paragraph height or something? (just because line height clearly is not sufficient). I'll look more into how to calculate font size so that we can have a more relative measurement that can generalize across different scans more easily, and see if I can apply that to the title identifier.

Duke-Chronicle-Project / article-extraction

Identify article titles #5