Open davanstrien opened 2 years ago
I suggest leaving this as a candidate dataset until we have worked out the best approach. Tagging others who have been discussing this: @bmschmidt @stefan-it
Currently, we have a few options for accessing the data:
Since this is a large corpus, and there may also be a little bit of concern about things being removed from the Europeana website, I think it makes sense to upload a version of the data that has been made more amenable for computational research. @bmschmidt has some code for doing this we could use as a starting point.
There are then a few options we need to decide on:
Could you explain the notion of a "loading script"? I don't think I understand how the huggingface model--which seems to basically organized hierarchically--works with something like this.
Especially around what seems like the fundamental question which is file ordering. Like I think it makes sense to have files be individual newspapers (or chronological subsets of newspapers), but that means there's waste if you try to subset by date of publication; and vice versa.
Could you explain the notion of a "loading script"? I don't think I understand how the huggingface model--which seems to basically organized hierarchically--works with something like this.
This depends a bit on how we decide to distribute the files. But one option is to have a dataset_script, which allows some control over what parts of the data are loaded. For example, if we have a bunch of files with a naming structure like TITLE_ID_YEAR.arrow
, a script could be used to only load the requested parts. This means in practice when someone downloads those files, they only need to download the files they actually plan to use. It is also possible to do this filtering once files are downloaded but obviously, this only saves some processing time/space since the data/files still had to be downloaded before.
Especially around what seems like the fundamental question which is file ordering. Like I think it makes sense to have files be individual newspapers (or chronological subsets of newspapers), but that means there's waste if you try to subset by date of publication; and vice versa.
Perhaps a compromise between granularity and keeping the total number of files reasonable would be to organize each title into decade (or some other time span) buckets. Something like:
TITLE_A_1850_1859.arrow
TITLE_A_1860_1869.arrow
TITLE_A_1870_1879.arrow
TITLE_B_1850_1859.arrow
Then the dataset script can filter which files to load/download. I'll try and dig out some example scripts that have this kind of functionality and link them here. Happy to hear other suggestions for structuring things too.
It seems like it would be possible to create the dataset according to reasonable chunkings and then afterwards write any post-hoc loading scripts that seemed like they'd be especially important? ("Austrian Papers," "German-language papers", "Communist papers", "Papers that published in the 1870s," etc.?)
FWIW, my solution for this was to break up newspapers into multiple files only when they got above a certain size. There are a lot of weekly or monthly publications of only a few pages which run for 20-30 years that it might be overkill to break up by decade. I think that I chunked by day but no more--it would certainly make sense to round to the nearest year to trim corpus items.
In terms of metadata--I think that the smallest unit of text should be the page, and that as much non-redundant metadata as possible should be supplied about each page. Columnar compression means that this won't be especially wasteful.
FWIW, my solution for this was to break up newspapers into multiple files only when they got above a certain size. There are a lot of weekly or monthly publications of only a few pages which run for 20-30 years that it might be overkill to break up by decade. I think that I chunked by day but no more--it would certainly make sense to round to the nearest year to trim corpus items.
I guess we could also use bigger chunks if we're going to end up with some very small slices. This is probably a little unavoidable for titles with very few issues but perhaps using 20 year chunks or larger would make sense for these titles?
In terms of metadata--I think that the smallest unit of text should be the page, and that as much non-redundant metadata as possible should be supplied about each page. Columnar compression means that this won't be especially wasteful.
Agree with this. I also feel uneasy using suggested article segmentation information since the quality of that can be so variable and will vary between titles/dates of publication. Do you start from the ALTO XML in your current script?
Hi guys!
I worked with the Europeana Dumps last week/weekend. Here are some obvervations:
The language information is stored in dc:language
from edm:ProvidedCHO
attribute. Normally, you would expect a string or an array of languages. But... it is mixed. For some issues, it is a string and for some it is an array. So in our final metadata representation we should use an array. And the language code is e.g. "de" instead of "deu".
Regarding to the OCR confidence: it is not stored in the metadata dump. You need to manually calculate it, and it is stored in ALTO on word-level (!). For page-level or issue-level it needs to be manually calculated. And you definitely need to download the ALTO dump for that!
For German and French I did create some plots that show the number of issues per year, based on the language information in the metadata and using the dcterms:issued
information.
For German it is:
For French:
So it seems that French data is very limited. I talked to @cneud and a license change from public domain to Gallica could explain that.
I've also extracted plain text data from ALTO files for German. The resulting plain text file has a size of 63GB. For pretraining the German Europeana BERT models I've used an older dump and the resulting plain text data had a size of 51GB, so this newer dump is larger.
Thanks so much for that @stefan-it. @bmschmidt @stefan-it, my suggested next step is to start with the smallest dataset from that dump to get to a format we're happy with. This will likely involve starting from the ALTO XML.
I think between us we probably all have some code for doing the ALTO XML parsing, as a starting point, I suggest we share that code (either linking here or adding a pull request to this repository), so we're not starting from scratch.
Does that sound okay to you both?
Hi, just to briefly chime in (I hope I can devote more time to this tomorrow) - I have a lot of background info, provenance and documentation about these datasets. While I am not passionate about the data formatting, I would appreciate a lot if this information can somehow be integrated with the dataset (e.g. as a simple README.txt), as I often get questions about this and I believe there is a lot more relevant information available than what is shared on Europeana. Any thoughts on how to best include this are very welcome! Otherwise I can offer to write sth down as Markdown or plain text when we have a shared repo.
Hi, just to briefly chime in (I hope I can devote more time to this tomorrow) - I have a lot of background info, provenance and documentation about these datasets. While I am not passionate about the data formatting, I would appreciate a lot if this information can somehow be integrated with the dataset (e.g. as a simple README.txt), as I often get questions about this and I believe there is a lot more relevant information available than what is shared on Europeana. Any thoughts on how to best include this are very welcome! Otherwise I can offer to write sth down as Markdown or plain text when we have a shared repo.
That would be great — one option would be to include this in the datacard? We could also include it as part of the dataset too.
It would also be great to have any context for this data. If you think there is anyone in particular at Europeana who would be good to keep in the loop about this work, let me know.
one option would be to include this in the datacard
Good suggestion, but indeed I wonder if the datacard will always be distributed with the data? If not, a simple README.txt might be more suitable perhaps?
anyone in particular at Europeana
Well, that would mainly be me as I was coordinator of the project where the data was produced :) I have also been working with/been in contact with ~20-30 researchers/initiatives that used this dataset, created subsets and derivatives etc which may also be worthwhile sharing. And I can also name a colleague employed by Europeana whom we should loop in once any concrete steps are taken.
Well, that would mainly be me as I was coordinator of the project where the data was produced :) I have also been working with/been in contact with ~20-30 researchers/initiatives that used this dataset, created subsets and derivatives etc which may also be worthwhile sharing. And I can also name a colleague employed by Europeana whom we should loop in once any concrete steps are taken.
Perfect! If you have time I'm happy to set up a meeting to discuss an approach that also works from the Europeana side? Would also be good to hear about any similar efforts, definitely don't want to duplicate existing work.
Great! I don't want to overload this with things from the past, but I think this would present a great opportunity to capture and document some of the background and context that have been sitting in my head/inbox/fragmented over multiple project websites for a while. Should we try to find a suitable date/time via email?
Great! I don't want to overload this with things from the past, but I think this would present a great opportunity to capture and document some of the background and context that have been sitting in my head/inbox/fragmented over multiple project websites for a while. Should we try to find a suitable date/time via email?
That sounds good, I'll drop you an email.
code for doing the ALTO XML parsing
Perhaps some of this code could be useful/repurposed:
Some initial input for the dataset card/README:
Hi @cneud , many thanks for that list!
I have one question left : was there any re-ocr done in the past years?
was there any re-ocr done in the past years
Unfortunately no. We are currently finalizing a report where we compare the old OCR quality with the performance that can be achieved with state-of-the-art neural OCR/layout analysis methods (such as e.g. our eynollah) and I can already say that the quality improvements by re-OCRing would be considerable. Europeana currently has no capacity to re-OCR though, and the computational and organisational effort for doing this in a distributed setting would likely require another project with funding :(
Here is a quick mapping from Europeana Dataset IDs to content providers
europeana-ID | library |
---|---|
9200359 | National Library of the Netherlands |
9200356 | National Library of Estonia |
9200301 | National Library of Finland |
9200408 | National Library of France (unpublished due to license) |
9200333 | Tessmann Library South-Tyrol |
9200303 | National Library of Latvia |
9200357 | National Library of Poland |
9200300 | Austrian National Library |
9200338 | Hamburg State and University Library |
9200355 | Berlin State Library |
9200339 | Belgrade University Library |
9200396 | National Library of Luxembourg |
Notes for discussion:
Info to include for each page:
{'OCRProcessing': {'processingDateTime': '2014-09-08',
'softwareCreator': 'ABBYY',
'softwareName': 'ABBYY FineReader Engine',
'softwareVersion': '11'},
'language': 'FR',
'mean_ocr': 0.8,
'std_ocr': 0.1,
'text': 'Text for page'}
@id: large_string
nc:text: large_string
newspaper_id: large_string
page: int32
dc:identifier: large_string
dc:language: large_string
dc:source: large_string
dc:subject: large_string
dc:title: large_string
dc:type: large_string
dc:extent: large_string
dc:isPartOf: large_string
dc:spatial: large_string
dc:relation: large_string
dc:hasPart: large_string
newspaper: large_string
dc:issued: date32[day]
Info to store in the path:
@davanstrien I will investigate some of issues that have multiple languages in the dc:language
field (resulting in array as data type) for both dump and API.
Are ALTO formats consistent across collections?
Within Europeana Newspapers, all OCR xml files are consistent, in that they are all using ALTO schema version 2.0.
Info to include for each page:
{'OCRProcessing': {'processingDateTime': '2014-09-08', 'softwareCreator': 'ABBYY', 'softwareName': 'ABBYY FineReader Engine', 'softwareVersion': '11'},
This part should be identical for most files and could also be documented on global dataset level. If there are any different entries though, this would allow identifying pages that were also processed with article separation by CCS software (docWORKS) merely from the ALTO (i.e. without EDM or METS).
* sample pack * text mining pack * XML co-ordinates
I personally like the different "packs" example from the National Library of Luxemburg (see https://data.bnl.lu/data/historical-newspapers/ and scroll down a bit) - they offer different sizes and different flavours of the data. I wonder how much of the creation of such "packs" could be done dynamically by the loading script?
The plain text should be straightforward to extract (beware of hyphenation and reading order), but I suppose @stefan-it has already done that.
A simplistic way to calculate OCR confidence per page is here, but there are certainly better ways that consider string length, compute mean/avg etc.
For those interested in image content in newspapers, it may be sufficient to extract bounding box coordinates of illustrations (and possibly also check <GraphicalElement>
) and keep that with the IIIF image URLs for each page, so that snippets of any image content detected on the pages can be automatically collected.
As mentioned in the call, I'm slapping my parsing code online. As mentioned in the blog post this is all throwaway notebooks I wrote primarily just to get the Neue Freie Presse out for grad student in my class, but I suspect it wouldn't be crazy hard to get it working on the other ALTO-XML dumps, if desirable.
@bmschmidt @cneud @stefan-it
Just to let you know, I am currently putting some processing code together for this. I'm essentially Frankensteinining the code you all shared already. I'll hopefully have something to share tomorrow.
Hi @davanstrien , I prepared a GIST that shows how to parse metadata information.
You basically just need to download the zip archives, there's no need to unpack them (it is all done in-memory):
https://gist.github.com/stefan-it/2b9b04caad3fd1d3ec94e5f1456cbd63
There are two examples:
Here's the list of issues with more than one detected language:
issues_with_more_languages.txt
It is pretty interesting, because we need to discuss them, e.g. these kind of entries:
3000051869684: MetaData(title='Österreichische Buchhändler-Correspondenz - 1870-02-20', year='1870', pages=8, languages=['==', 'de'])
Where ==
is mistakenly used as language identifier?!
The ALTO parsing stuff is coming in another GIST, soon :)
Thanks for this, @stefan-it. I have the alto parsing done (adapting code from @cneud) but feel free to share if it's ready anyway :) For the metadata, I'm currently getting this via the API (adapting @bmschmidt's code).
I will check to see how different these two sets of metadata are for an item. I assume the API should hold fresher metadata in theory, but I don't know how much of a difference this makes. If it's possible to use the dumps and there isn't much of a difference in the metadata, then we'll probably prefer to use the dump files instead.
I'm also adding the IIIF manifest URLs for each item, plus links to the IIIF image of the page (and for those items where the ALTO XML predicts illustrations, I'm including a list of IIIF URLs with those regions cropped).
@bmschmidt @cneud I don't feel it makes sense to include the full IIIF manifest in the records (just the URL) but let me know if you disagree. I would suggest including some demo code of how to grab the full manifest in the dataset card.
Where
==
is mistakenly used as language identifier?!
I'm shocked you don't speak ==
😜
Example of metadata from dump:
{'rdf:RDF': {'@xmlns:cc': 'http://creativecommons.org/ns#',
'@xmlns:dc': 'http://purl.org/dc/elements/1.1/',
'@xmlns:dcterms': 'http://purl.org/dc/terms/',
'@xmlns:doap': 'http://usefulinc.com/ns/doap#',
'@xmlns:edm': 'http://www.europeana.eu/schemas/edm/',
'@xmlns:foaf': 'http://xmlns.com/foaf/0.1/',
'@xmlns:ore': 'http://www.openarchives.org/ore/terms/',
'@xmlns:owl': 'http://www.w3.org/2002/07/owl#',
'@xmlns:rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'@xmlns:rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'@xmlns:skos': 'http://www.w3.org/2004/02/skos/core#',
'@xmlns:svcs': 'http://rdfs.org/sioc/services#',
'@xmlns:wgs84_pos': 'http://www.w3.org/2003/01/geo/wgs84_pos#',
'edm:Place': {'@rdf:about': 'http://d-nb.info/gnd/4016680-6',
'skos:prefLabel': 'Feldkirch'},
'edm:ProvidedCHO': {'@rdf:about': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073475663',
'dc:identifier': 'oai:fue.onb.at:EuropeanaNewspapers_Delivery_2:ONB_00268/1850/ONB_00268_18500115.zip',
'dc:language': 'de',
'dc:source': {'@rdf:resource': 'http://anno.onb.ac.at/cgi-content/anno?apm=0&aid=voz&datum=18500115'},
'dc:subject': {'@rdf:resource': 'http://d-nb.info/gnd/4067510-5'},
'dc:title': 'Vorarlberger Zeitung - 1850-01-15',
'dc:type': [{'@rdf:resource': 'http://schema.org/PublicationIssue'},
{'#text': 'Analytic serial', '@xml:lang': 'en'},
{'#text': 'Newspaper', '@xml:lang': 'en'},
{'#text': 'Newspaper Issue', '@xml:lang': 'en'}],
'dcterms:extent': {'#text': 'Pages: 4', '@xml:lang': 'en'},
'dcterms:isPartOf': [{'@rdf:resource': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073527530'},
{'@rdf:resource': 'http://data.theeuropeanlibrary.org/Collection/a0600'},
{'#text': 'Europeana Newspapers', '@xml:lang': 'en'}],
'dcterms:issued': '1850-01-15',
'dcterms:spatial': {'@rdf:resource': 'http://d-nb.info/gnd/4016680-6'},
'edm:isNextInSequence': {'@rdf:resource': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073479497'},
'edm:type': 'TEXT'},
'edm:WebResource': [{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg',
'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002/full/full/0/default.jpg',
'edm:isNextInSequence': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg'},
'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003/full/full/0/default.jpg',
'edm:isNextInSequence': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002/full/full/0/default.jpg'},
'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004/full/full/0/default.jpg',
'edm:isNextInSequence': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003/full/full/0/default.jpg'},
'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004'}}],
'ore:Aggregation': {'@rdf:about': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073475663#aggregation',
'edm:aggregatedCHO': {'@rdf:resource': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073475663'},
'edm:dataProvider': 'Österreichische Nationalbibliothek - Austrian National Library',
'edm:hasView': [{'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002/full/full/0/default.jpg'},
{'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003/full/full/0/default.jpg'},
{'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004/full/full/0/default.jpg'}],
'edm:isShownAt': {'@rdf:resource': 'http://anno.onb.ac.at/cgi-content/anno?apm=0&aid=voz&datum=18500115'},
'edm:isShownBy': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg'},
'edm:object': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg'},
'edm:provider': {'#text': 'The European Library', '@xml:lang': 'en'},
'edm:rights': {'@rdf:resource': 'http://creativecommons.org/publicdomain/mark/1.0/'}},
'svcs:Service': [{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003',
'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002',
'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004',
'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001',
'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}}]}}
@albertvillanova does the Huggingsets datasets API support any standards for rich descriptions like this in the arrow metadata, at either the file or recordbatch level? It seems like a shame to throw it away. I've had on the back burner for a while a scheme to get ML people using the column description format of the W3C's CSV on the web spec, which is a bit too much to bite off here; but as a stopgap I often try to put some of this stuff into arrow metadata where it won't get into anyone's way. But sometimes loading scripts won't copy the metadata parts of the arrow schema.
(Sorry if I'm just making this over-complicated--I'm asking b/c I think this is an interesting test case of some places where these fields don't speak each other's language.)
@davanstrien Thanks for tackling all this. One small note--all the metadata I could find was of the form'dc:title': 'Vorarlberger Zeitung - 1850-01-15'
, but I think for typical use cases it's important to drop the date information from that field (which is captured in dcterms:issued
) from the back of the title to allow more regular filtering.
@davanstrien Thanks for tackling all this. One small note--all the metadata I could find was of the form
'dc:title': 'Vorarlberger Zeitung - 1850-01-15'
, but I think for typical use cases it's important to drop the date information from that field (which is captured indcterms:issued
) from the back of the title to allow more regular filtering.
My current plan was to parse some fields that we consider particularly useful with a big of additional validation and then to shove some other metadata (the full extent is still up for discussion) either into flattened columns or into a bit of a generic metadata dump column. I'm hoping to have a proper draft of this ready on Monday. I'll ping you @cneud @stefan-it to discuss the output
A quick update on this:
I have some semi-working code with some rough edges here (https://github.com/davanstrien/altoxml2dataset).
Currently, the code:
I'm currently parsing the metadata from the XML metadata dumps on the website. I went back and forth about this but think this might be the best option for now with the possibility to give guidance on accessing the metadata from the API in the documentation for the dataset (more on this below).
For the metadata I currently get:
==
. If someone finds out that this is significant in some way, we can add it back in. [ ] what (other) metadata do we want to include in a 'flattened' format, i.e. include a specific column for that field with a nice name? I plan to also use the dates and the languages (where there is only a single one) in the filenames for the parquet files. We can return to this once we've agreed on the other fields and decide what level of granularity makes sense.
[ ] I have tried dumping most of the metadata inside a dictionary to a 'complete/additional metadata' column, but at the moment, Arrow gets upset at this. I think there are two options here:
I'm leaning toward the third option but happy to hear arguments in favour of the others.
An example instance (the filepaths will be tidied to ensure we don't have the parent directories):
{'fname': 'test_data/9200396/BibliographicResource_3000118436002/75.xml',
'text': "15 * Décembre qu’ils avoient fous leurs yeux qu’il falloit s’attacher, puifque le falut de tous en rélul- toit ; auffi-tôt il les guide , prend une échelle, & monte avec eux fur le toit de cette, grange grange , où ils paffent la nuit à y étouffer les étincelles étincelles & à faire tomber les charbons que l’activité l’activité des flammes lançoit continuellement fur cette couverture de chaume. Ainû votant leur patrimoine fe confumer près d’eux, ils fe dévouèrent généreufement à un travail qui devoir conferver au moins deux cents mai- fons, & qui en effet arrêta les progrès de l’incendie. On écrit de St. Euftache qu’un ouragan plus terrible que celui de 1738 a caufé des dommages confidérables à la Guadeloupe & à Grandiere : tous les vaiffeaux ont été jettés fort avant dans les terres, & 011 défef- pere de pouvoir les remettre en mer. Indépendamment Indépendamment de quelques maifons de pierre, toutes celles qui étoient bâties en bois, ont été abattues par la violence du vent. Cet ouragan étoit accompagné d’une pluie conli- dérable, qui en peu de tems forma une efpece de déluge. On vient d’établir, fous la protection du gouvernement, une manufacture de Aparté- rie au fauxbourg St. Antoine; c’eft une fabrication fabrication de cordages avec la plante que les naturalises appellent gramen fpartcnm. On connoit dans la marine le fparion , cordage de genêt d’Efpagne, d’Afrique & de Murcie; Murcie; d’un bon ufage, l'oit à l’eau de mer, foie à l’eau douce. Hé fleur Bcithc, qui dt 651",
'mean_ocr': 0.5280378429774902,
'std_ocr': 0.18456327090617758,
'bounding_boxes': [],
'item_id': '9200396/BibliographicResource_3000118436002',
'metadata_xml_fname': 'test_data/metadata/http%3A%2F%2Fdata.theeuropeanlibrary.org%2FBibliographicResource%2F3000118436002.edm.xml',
'title': 'Journal historique et littéraire',
'date': '1776-12-15',
'languages': ['fr'],
'item_iiif_url': 'https://iiif.europeana.eu/image/2TS6TSUK5ULAT2TQMYN7UGBDCPKQBLHQPTPDD6GGYB4QOZXR72EQ/presentation_images/bc994340-0232-11e6-a696-fa163e2dd531/node-3/image/BNL/Journal_historique_et_littéraire/1776/12/15/00547/full/full/0/default.jpg',
'multi_language': False}
There are quite a few rough edges/things to finish with the code, but I wanted to get input on these things before proceeding too far down one path.
I have made minimal effort to make any of this code performant (the only considerations on this front are using slotted classes and multiprocessing). Since this code will only be run occasionally, I don't think much optimization is worthwhile here, but I'll take a quick look for any easy improvements later this week.
Thanks for tackling this.
I don't know what you've got in additional metadata, but the simplest route to shoving 'etc' into a column is to encode as JSON before stuffing it in there. If it's relatively short that might be worth it. One thing to avoid at all costs is a plan that works for arrow-encoding each individual file but ends up with different schemas for different files.
Shorter notes, which are extremely pedantic and I'm sorry for that but I feel like that's the name of the game here.
item_id
which is obscure in both what an item is and what the name space is, I'd prefer a key called issue_uri
of the form https://www.europeana.eu/item/9200396/BibliographicResource_3000118436002
. Reason being that it's a sin to replace a universal id with a local one, and that it's easy to misread 'item_id' as referring to 'this page' rather than 'the issue containing this page.''fname': 'test_data/9200396/BibliographicResource_3000118436002/75.xml'
and metadata_xml_fname
capture a path that's very dump-dependent.id
or @id
: I would suggest 9200396/BibliographicResource_3000118436002/75
or 9200396/BibliographicResource_3000118436002$75
.language
to languages
because it's a dc term in the original items. I don't think 'multi_language' is necessary column.1945-12
is a valid date) it's probably worth checking if automatic conversion works.'item_iiif_url': 'https://iiif.europeana.eu/image/2TS6TSUK5ULAT2TQMYN7UGBDCPKQBLHQPTPDD6GGYB4QOZXR72EQ/presentation_images/bc994340-0232-11e6-a696-fa163e2dd531/node-3/image/BNL/Journal_historique_et_littéraire/1776/12/15/00547/full/full/0/default.jpg'
Perhaps someone at Europeana can confirm is these hash-laden urls are the best we can do? Good lord, they're terrible. 'item_iiif_url': 'https://iiif.europeana.eu/image/2TS6TSUK5ULAT2TQMYN7UGBDCPKQBLHQPTPDD6GGYB4QOZXR72EQ/presentation_images/bc994340-0232-11e6-a696-fa163e2dd531/node-3/image/BNL/Journal_historique_et_littéraire/1776/12/15/00547/full/full/0/default.jpg'
...calling Europeana's @hugomanguinhas - any ideas or suggestions?
Even though the images are being served via an Europeana domain (via our gateway)... they are actually hosted by the technical partner (PSNC) in the Europeana Newspapers project and the URLs are based on the eCloud infrastructure which has a very complex naming and versioning system.... this is something we dont control and I do agree that the URLs seem unnecessarily long and also make our IIIF output bigger than it could be.
@bmschmidt @cneud @hugomanguinhas thanks all; I will try and find time to work on this a bit more next week. I hope to have an initial version of the full output to review by then.
@bmschmidt @cneud @hugomanguinhas thanks all; I will try and find time to work on this a bit more next week. I hope to have an initial version of the full output to review by then.
Apologies for the radio silence on this, I got busy with other things. I have blocked out some time to work on this later this week.
I finally have the start of a suggested approach for this dataset. The dataset can be found here: https://huggingface.co/datasets/biglam/europeana_newspapers.
This repo includes a sample of parquet files with text and some metadata organised by language and decade i.e. language-decade.parquet
.
There is also a loading script which allows you to load a subset of this data either using an existing language configuration which you can list using:
from datasets import get_dataset_config_names
get_dataset_config_names("biglam/europeana_newspapers")
>>> ['sv', 'fi']
or you can pass in a list of languages you want to load:
from datasets import load_dataset
dataset = load_dataset("biglam/europeana_newspapers", languages=["sv","fi"])
There is also an option to filter by min/max decade:
from datasets import load_dataset
dataset = load_dataset("biglam/europeana_newspapers", languages=["sv","fi"], min_decade=1910)
All of these options will only download the data required. If you only as for fr
, you won't download any other languages, i.e. you only download what is needed. Although the data size is much reduced from the original XML files, I think it's still desirable to make it easy to filter before downloading as far as practical.
The loading script still has some rough edges, but hopefully, this gives a sense of how it will work.
I felt that shoving all the extra metadata into a column was getting a bit clunky. Instead, I'm going to outline in the README/Datacard how to do this using the Europeana API. This will give people a sense of how they can grab the metadata they need. I will also create another version of this dataset that includes this extra metadata (also grabbed using the API). We can then see which ends up being downloaded more often, which may give us some sense of how much the potential users of this data value the additional metadata.
I will add all of the data for this short but I thought this already gave a sense of how things would work.
cc @cneud @bmschmidt @stefan-it @cakiki
I would be happy to have feedback if you think anything is missing/could be improved
Thank you @davanstrien! It would be awesome if one could also load subsets e.g. based on their OCR confidence, but since this information is not included in the Europeana metadata but only in the ALTO files I don't think it can be done so easily.
Generally I agree that due to different interests and also size constraints, offloading the fetching of additional metadata to the use of the Europeana API is a reasonable way forward. I will also try to contribute to the README/Datacard.
Thank you @davanstrien! It would be awesome if one could also load subsets e.g. based on their OCR confidence, but since this information is not included in the Europeana metadata but only in the ALTO files I don't think it can be done so easily.
There would be a way to filter by ocr confidence as part of the loading script but this would still involve downloading all of the data before then so probably isn't worth it. It might also end up being more efficient to do it once the dataset is loaded since you can then also use multiprocessing i.e. something like
good_ocr_ds = ds.filter(lambda x: x['mean_ocr']>=0.9, num_proc=8)
Generally I agree that due to different interests and also size constraints, offloading the fetching of additional metadata to the use of the Europeana API is a reasonable way forward. I will also try to contribute to the README/Datacard.
That would be great, I'll make a start on that soon!
A URL for this dataset
https://pro.europeana.eu/page/iiif#download
Dataset description
This is a dataset of historic newspapers digitised by various national libraries and made available via the Europeana platform.
Dataset modality
Text
Dataset licence
Other license
Other licence
Public Domain Mark for full text and http://creativecommons.org/publicdomain/zero/1.0/ for the metadata
How can you access this data
As a download from a repository/website
Confirm the dataset has an open licence
Contact details for data custodian
No response