refining etext score - Githubissues

buda-base / git-to-dbs

Git to CouchDB / Fuseki / CouchDB for BDRC Lib App

Apache License 2.0

0 stars 0 forks source link

refining etext score #50

Open eroux opened 4 weeks ago

eroux commented 4 weeks ago

I think etext access is a very good score but we should balance it with a sort of etext quality score. (I guess we should do that for scans too but currently we can't really get some metrics about that).

There's several types of etexts:

manual input with pages matching the images (the best type)
manual input with pages not matching the images (less convenient but etext is good quality)
OCR that has been post-processed and is generally good quality
OCR with no post-processing, for which we have a quality note between 0 and 1

@roopeux how should I represent that?

roopeux commented 4 weeks ago

@eroux, the two ways to do this are either one integer field or many boolean fields.

Either scan_quality: 1-4

is_best_etext: true
is_good_etext: false
is_good_ocr_etext: false
is_unprocessed_ocr_etext: false

The one integer field is probably the best choice for now to avoid clutter. In our current script there is no difference.

Preparing for Learning to Rank and LambdaMART, LambdaMART should handle integer fields, although boolean may offer more control, because with integers, the model may just decide that anything above 1.78 is good and the rest is bad.

So my suggestion is one integer field, which we might have to reindex as many boolean fields as a part of fine-tuning LambdaMART in the far future.

BTW, the same applies to access fields too.

eroux commented 4 weeks ago

ok that's good! one difficulty is that for unprocessed ocr etexts I have a float value representing the quality, should I keep it as a float value (between 0 and 1) or should I make some intervals like

1 = ocr quality in [0.0 - 0.9] 2 = ocr quality in [0.9 - 0.95] 3 = ocr quality in [0.95 - 1]

eroux commented 3 weeks ago

implementing a new field etext_quality, a half_float, with the following possible values:

4.0 = manual input, aligned with pages
3.0 = manual input, not aligned with pages
2.0 = reviewed OCR, aligned with pages
between 0.0 and 1.0 : OCR, the value is the median confidence index returned by the OCR system