NatLibFi / bib-rdf-pipeline

Scripts and configuration for converting MARC bibliographic records into RDF
Creative Commons Zero v1.0 Universal
29 stars 5 forks source link

Structured page counts are not valid in schema.org #19

Open osma opened 7 years ago

osma commented 7 years ago

Our MARC records have structured page counts, e.g. vii, 89, 31 s.. However, Schema.org only defines a single integer field schema:numberOfPages so the structured values are not really valid Schema.org.

Maybe we should convert those structured counts into a single number? It can't be done in SPARQL easily (Roman numerals!) but a relatively simple filter script (e.g. Python) could do it.

osma commented 7 years ago

Here is some R code to normalize structured page counts: https://github.com/rOpenGov/bibliographica/blob/master/R/estimate_pages.R

antagomir commented 7 years ago

Normalizing structured page counts is not as straightfwd task as it first looks like. The reasons are many: spelling variations, ambiguous cases, terms or stopwords from multiple languages, handling of various exceptions. Moreover, many documents have only cover page information which will give a misleading page count estimate if converted directly. Anyway, the R code cited above is essentially ready, backed up by unit tests and extensive manual checking, and cleans up page counts for the complete Fennica catalog.

osma commented 7 years ago

@antagomir Thanks, looks really useful! The whole point of this pipeline is to stitch together existing tools instead of reinventing the wheel. Probably I just need to implement some glue code, e.g. a filter that can take N-Triples with structured page counts from stdin and output normalized page counts on stdout, using your normalization function behind the scenes.

antagomir commented 7 years ago

Similar things could be considered for the other fields as well.

osma commented 6 years ago

The proposed materialExtent property in Schema.org could be used to represent the original, structured page count. Still, the normalized page count is probably much more useful for most analysis purposes. We could simply provide them both.

antagomir commented 6 years ago

Yes and both may be needed. I agree. Hopefully we can soon activate with this a bit more again,