Structured full-text cleaning protocol under R entities (in Detail)

davidzbiral commented 9 months ago

Similarly to storing data collection protocols under T, we need to store structured information on full-texts under Rs which represent those full texts. (BTW technically, the ocr-ed PDF is a different R than our cleaned full-text, no worries there, it is ontologically correct.)

Technically, similarly to data collection protocols under T (but with a different set of fields), it will be a section under Detail of that R, partly with suggesters, partly with dropdowns or switches, partly with text fields (i.e., semi-structured). It will appear only under those Rs which have a full-text attached.

The set of variables will roughly follow our Corpus progress gsheet - David to develop and finalise - and when importing full-texts, we will also produce fill them from here (David to add instruction rows to the corpus progress gsheet).

It should reflect three different situations:

Cleaning of an OCR-ed edition and HTR output (devise the set of vars so that it is, if possible, same for both).
Digital-born edition in progress.

For DZ: When developing, don't forget to mark whether the full-text is in full relation to the T or not - think about the ontology based on what already exists. I.e. whether the edition is partial. Now done under 2nd order prop of the "edition" prop in a T -> R relation in DDB1.

adammertel commented 7 months ago

@davidzbiral does this involve any warnings and validations on Statement (or any other entity class) level similarly to the #1924?

davidzbiral commented 7 months ago

@adammertel No, it is completely separated from that, and is not associated with any warnings (they would not even be possible, because we don't expect the app to be able to measure full-text quality). It is only evidence concerning the quality of the full-text: e.g. whether notes were removed, whether full text was reasonably checked etc.

Tomáš expressed a very sound idea some months ago that these things should be kept in the database itself rather than an external google sheet. I suggest the following text fields:

Non-main text removal
Editorial brackets removal
Dehyphenation
Text cleaning note
Character Error Rate estimate
Character Error Rate measurment
Character Error Rate note

If possible, display only under those Rs which have a full-text linked. Upon full-text unlinking, warn that this info will be lost. (If full-text unlinking should be possible at all - I wonder how you think about it.)

adammertel commented 7 months ago

Thanks, created #1981 to separate the work needed inside the code.

adammertel commented 7 months ago

@davidzbiral just to make sure, the values are associated more with the Resource or the fulltext document itself? To put it a more specific way - when the full-text document is unassigned from Resource entity, do these values "move" with the document or "stay" with the Resource?

davidzbiral commented 7 months ago

@adammertel It's true that they belong more to the full text itself, so they should move with the document - just I don't see how to do it in the current data model. Also it is important to say that under normal circumstances, nothing such as unlinking a full-text will ever happen, because the R is the full-text. E.g., a printed book, a PDF scan of it, and an OCR of it, are three different Rs. So unlinking the full-text will only happen in cases of misclicking when choosing file, or erroneous import of a bad version. That is, I am not sure whether unlinking should be an easy operation offered all the time when I am on an R. What do you suggest? Honestly, I begin to see some limitations here, because this will not give us conveniently all the vars we need to record on text cleaning, and I don't think it will lead where I hoped. There is too many vars that we want to edit, and InkV is not as flexible as a table to fit the variable teams' needs for recording text cleaning progress. The set of fields would be too DISSINET specific, and even for DISSINET, too incomplete.

Let's close this issue - we will not implement, we will continue doing the evidence in a gsheet.

DISSINET / InkVisitor

Structured full-text cleaning protocol under R entities (in Detail) #1932