Open davanstrien opened 3 years ago
From #2355
I assume we should annotate that these are computationally produced.
lwm_publication_issue_item_title
is missing a lot of the time (I am seeing how often for HMD newspapers in the ticket (#10) and is often a garbled mess when it is present.If I recall, this traces itself back to the source XML, and is often the first sentence (or number of chars, or line) of the 'article' or text segment. I seem to recall @kasparvonbeelen might have been here before, and found some instances has corrected OCR compared to the first sentence of the 'article' it came from
Maybe we need to raise this with FMP/BL people to see how consistently the titles are corrected? From the HMD samples I've looked at I'm doubtful they have been human corrected (and if they have not much care has been taken with this...) but maybe this is something they have done with other titles and/or only for selected numbers/issues or something or only for titles in the main BNA 🤷♂️
Maybe we get some of these corrections from FMP? https://www.britishnewspaperarchive.co.uk/help-faq/why-should-i-correct-the-text#1 via their end users who are able to make corrections on the BNA platform. If we do get these then it would be useful to try and capture this in the metadata.
@davanstrien from what I remember, some titles were transcribed (and the original OCR repeated at the start of the article). However, this seemed limited to a subsection of FMP newspapers. It'd be good to know how many articles have been corrected. Probably FMP should have this information?
@davanstrien from what I remember, some titles were transcribed (and the original OCR repeated at the start of the article). However, this seemed limited to a subsection of FMP newspapers. It'd be good to know how many articles have been corrected. Probably FMP should have this information?
yes, I'm hoping someone is able to ask them to confirm - it would also be interesting to know which titles get corrected most (when the OCR is really bad or when the content of the article is realy good 🤔)
If we don't get answers, we could computationally test our theory by looking at title in metadata vs first sentence of article and see if there are any patterns. Might help us understand some more of the qualities of the datasets we're working on. It may also give us some training data for OCR corrections, if needed
I believe this is going to be asked as part of https://github.com/alan-turing-institute/Living-with-Machines/issues/2414
From #2355
If I recall, this traces itself back to the source XML, and is often the first sentence (or number of chars, or line) of the 'article' or text segment. I seem to recall @kasparvonbeelen might have been here before, and found some instances has corrected OCR compared to the first sentence of the 'article' it came from