Quantify how often metadata is missing in XML

Living-with-machines / alto2txt

Convert ALTO XML to plain text + minimal metadata

https://living-with-machines.github.io/alto2txt/

MIT License

12 stars 1 forks source link

Quantify how often metadata is missing in XML #10

Open davanstrien opened 3 years ago

DavidBeavan commented 3 years ago

From #2355

I assume we should annotate that these are computationally produced. lwm_publication_issue_item_title is missing a lot of the time (I am seeing how often for HMD newspapers in the ticket (Living-with-machines/alto2txt#10) and is often a garbled mess when it is present.

If I recall, this traces itself back to the source XML, and is often the first sentence (or number of chars, or line) of the 'article' or text segment. I seem to recall @kasparvonbeelen might have been here before, and found some instances has corrected OCR compared to the first sentence of the 'article' it came from

davanstrien commented 3 years ago

From #2355

I assume we should annotate that these are computationally produced. lwm_publication_issue_item_title is missing a lot of the time (I am seeing how often for HMD newspapers in the ticket (#10) and is often a garbled mess when it is present.

If I recall, this traces itself back to the source XML, and is often the first sentence (or number of chars, or line) of the 'article' or text segment. I seem to recall @kasparvonbeelen might have been here before, and found some instances has corrected OCR compared to the first sentence of the 'article' it came from

Maybe we need to raise this with FMP/BL people to see how consistently the titles are corrected? From the HMD samples I've looked at I'm doubtful they have been human corrected (and if they have not much care has been taken with this...) but maybe this is something they have done with other titles and/or only for selected numbers/issues or something or only for titles in the main BNA 🤷‍♂️

davanstrien commented 3 years ago

Maybe we get some of these corrections from FMP? https://www.britishnewspaperarchive.co.uk/help-faq/why-should-i-correct-the-text#1 via their end users who are able to make corrections on the BNA platform. If we do get these then it would be useful to try and capture this in the metadata.

kasparvonbeelen commented 3 years ago

@davanstrien from what I remember, some titles were transcribed (and the original OCR repeated at the start of the article). However, this seemed limited to a subsection of FMP newspapers. It'd be good to know how many articles have been corrected. Probably FMP should have this information?

davanstrien commented 3 years ago

@davanstrien from what I remember, some titles were transcribed (and the original OCR repeated at the start of the article). However, this seemed limited to a subsection of FMP newspapers. It'd be good to know how many articles have been corrected. Probably FMP should have this information?

yes, I'm hoping someone is able to ask them to confirm - it would also be interesting to know which titles get corrected most (when the OCR is really bad or when the content of the article is realy good 🤔)

DavidBeavan commented 3 years ago

If we don't get answers, we could computationally test our theory by looking at title in metadata vs first sentence of article and see if there are any patterns. Might help us understand some more of the qualities of the datasets we're working on. It may also give us some training data for OCR corrections, if needed

davanstrien commented 3 years ago

I believe this is going to be asked as part of https://github.com/alan-turing-institute/Living-with-Machines/issues/2414