DDMAL / Rodan

:dragon_face: A web-based workflow engine.
https://rodan2.simssa.ca/
47 stars 13 forks source link

[Text Alignment] New data from MS73 is low quality #983

Open MrMondrian opened 1 year ago

MrMondrian commented 1 year ago

Recently we received an ocr data set for MS73 from a past lab member. Unfortunately this data is of a slightly lower quality than the data from Salzinnes and St Gallen. Firstly, the decorative capital letters are labelled with their letter value as opposed to ~. Also, many of the images have blur, are not centred, or scaled weirdly. Here are some examples.

00_046_text_layer bin Alleluia

05_257_text_layer bin confidentem Perpetua Gloria

02_045_text_layer bin rem et

To integrate this data into new models, the capital letters need to be re-annotated, and the messy images may need to be cleaned

MrMondrian commented 1 year ago

Another issue with these data are that MS73 doesn't group syllables into words but instead places each syllable below its corresponding neume. This format is different from Salzinnes and St Gallen which could confuse the model

Salzinnes:

01000a bin re et in insulus quae procul sunt dicite

St Gallen: 01000a bin āO thoma didime per xpicitum quem meruisti tangere te precibus roga

MS73: 03_045_text_layer bin martirium

timothydereuse commented 1 year ago

Salzinnes / Einsie also display the same behavior of separating syllables when necessary: this is from folio 002v from Salzinnes

image Transcript: dei potentiam venientem et nebulam totam terram tegentem ite obviam

In practice it's not been a huge problem in earlier models. Maybe it happens more in MS73 than in others, but the character-level segmentation of Calamari works pretty well regardless.

Possible way to proceed: if we know what pages of MS73 we've got in the training data, we can re-run them through the current text strip segmentation process and re-associate them with each transcript, which shouldn't be that hard if it identifies the lines correctly. we can just replace any capital letter in the transcripts with a ~ as well, automatically.