buda-base / git-to-dbs

Git to CouchDB / Fuseki / CouchDB for BDRC Lib App
Apache License 2.0
0 stars 0 forks source link

refining etext score #50

Open eroux opened 4 weeks ago

eroux commented 4 weeks ago

I think etext access is a very good score but we should balance it with a sort of etext quality score. (I guess we should do that for scans too but currently we can't really get some metrics about that).

There's several types of etexts:

@roopeux how should I represent that?

roopeux commented 4 weeks ago

@eroux, the two ways to do this are either one integer field or many boolean fields.

Either scan_quality: 1-4

Or

The one integer field is probably the best choice for now to avoid clutter. In our current script there is no difference.

Preparing for Learning to Rank and LambdaMART, LambdaMART should handle integer fields, although boolean may offer more control, because with integers, the model may just decide that anything above 1.78 is good and the rest is bad.

So my suggestion is one integer field, which we might have to reindex as many boolean fields as a part of fine-tuning LambdaMART in the far future.

BTW, the same applies to access fields too.

eroux commented 4 weeks ago

ok that's good! one difficulty is that for unprocessed ocr etexts I have a float value representing the quality, should I keep it as a float value (between 0 and 1) or should I make some intervals like

1 = ocr quality in [0.0 - 0.9] 2 = ocr quality in [0.9 - 0.95] 3 = ocr quality in [0.95 - 1]

?

eroux commented 3 weeks ago

implementing a new field etext_quality, a half_float, with the following possible values: