Open MrMondrian opened 1 year ago
Another issue with these data are that MS73 doesn't group syllables into words but instead places each syllable below its corresponding neume. This format is different from Salzinnes and St Gallen which could confuse the model
Salzinnes:
re et in insulus quae procul sunt dicite
St Gallen: āO thoma didime per xpicitum quem meruisti tangere te precibus roga
MS73: martirium
Salzinnes / Einsie also display the same behavior of separating syllables when necessary: this is from folio 002v from Salzinnes
Transcript: dei potentiam venientem et nebulam totam terram tegentem ite obviam
In practice it's not been a huge problem in earlier models. Maybe it happens more in MS73 than in others, but the character-level segmentation of Calamari works pretty well regardless.
Possible way to proceed: if we know what pages of MS73 we've got in the training data, we can re-run them through the current text strip segmentation process and re-associate them with each transcript, which shouldn't be that hard if it identifies the lines correctly. we can just replace any capital letter in the transcripts with a ~
as well, automatically.
Recently we received an ocr data set for MS73 from a past lab member. Unfortunately this data is of a slightly lower quality than the data from Salzinnes and St Gallen. Firstly, the decorative capital letters are labelled with their letter value as opposed to ~. Also, many of the images have blur, are not centred, or scaled weirdly. Here are some examples.
Alleluia
confidentem Perpetua Gloria
rem et
To integrate this data into new models, the capital letters need to be re-annotated, and the messy images may need to be cleaned