LAHTeR / htr-quality-classifier

Detect quality of (digitized) text.
GNU General Public License v3.0
3 stars 0 forks source link

Improve detection on (almost) empty documents #3

Closed carschno closed 1 year ago

carschno commented 1 year ago

But what I struggle with a bit when using the output of the Lahter tool are (almost) blank pages. If you look at the output I made yesterday (which we briefly looked at during our discussion; the sheet where I compare the quality of the Transkribus Transformers model with that of Loghi for a set of 502 random scans), you can see that (almost) blank pages can get a ‘1’, ‘2’ or ‘3’ seemingly randomly. This is tricky. Many of these pages are indeed empty, and then an empty PageXML should ideally get a ‘1’. So I would suggest that a PageXML with no baselines should either get a ‘1’, or no score. Other (almost) empty pages have some spots that are sometimes recognised as characters (loose characters, short baselines). These should actually get a ‘3’. Still others have some text (e.g. an address / salutation to a letter) that is rotated by 90 degrees. This translates in the PageXML into single characters on short baselines, and should also get a ‘3’. I don’t know if you can do anything with this (maybe include average baseline length as a classifier?), but it seemed good to mention anyway. I am now sending along the Lahter output I generated yesterday. An interesting example is NL-HaNA_1.04.02_1277_0602. That one is (correctly) recognised as empty by Transkribus and should actually get a ‘1’ instead of a ‘3’. Loghi sees things here and should better have a ‘3’ as a score (instead of the current 2). The same applies, for example, to NL-HaNA_1.04.02_8904_0004 and NL-HaNA_1.04.02_10007_0229 (here Loghi gets a ‘1’ because it hallucinates a character that is in the dictionaries). NL-HaNA_1.04.02_1396_1322 is an example of a page turned 90 degrees (but anything but empty), which could maybe be classified as ‘3’ based on its short baselines and/or high number of small regions?

Excel Spreadsheet 230724_transformers_vs_loghi.xlsx

kintopp commented 1 year ago

Perhaps a blank (no score) would be best for pages with no detected baselines. One could also imagine labelling these B for blank, for example, but presumably there will some edge cases where the page isn't blank but no baselines were detected. And a blank (in the sense of 'don't know' / 'can't tell') would also help avoid Lahter from gradually becoming a layout classification tool (i.e. as opposed to building a separate tool for that for that purpose).