NationalMuseumAustralia / Collection-API

The public web API of the National Museum of Australia
11 stars 0 forks source link

EMu dates in the wrong order are not handled #94

Closed Conal-Tuohy closed 5 years ago

Conal-Tuohy commented 6 years ago

Dates in the ProEarliestDate0 and ProLatestDate0 fields of EMu object records are sometimes entered in the wrong order.

These fields are converted into crm:P82a_begin_of_the_begin and crm:P82b_end_of_the_end properties in the RDF data, and later these two properties are brought together to create a solr.DateRangeField field named datestamp. Solr then rejects the deposit of records whose datestamp fields are out of order:

<str name="msg">ERROR: [doc=object/7704] Error adding field 'temporal_date'='[2006 TO 1971]' msg=Wrong order: 2006 TO 1971</str>
Conal-Tuohy commented 6 years ago

I think we can easily enough re-order the dates when we convert the source XML to RDF, but it would be good to record when that happens to feed back to registration so the source data can be fixed. Maybe this is something we can look at more generally in the next major phase of work? i.e. we could (perhaps as part of the RDF mapping process, or perhaps subsequently using SHACL shapes to validate the RDF) record any apparent data quality issues which we can detect, and store them in the RDF of the internal API, where you could search for them and view the errors. Then the API could help to serve the registration people's QA efforts.

Conal-Tuohy commented 6 years ago

Karen Peterson says:

The date range issue is more complicated than switched around and is primarily related to probably an import where the original date was mm/yyyy and was split with a '20' added to the mm for the earlier date and the yyyy moved to the latest date (eg. 11/1934 became 2011 & 1934). Weird but believe this happened sometime ago. Not all the examples are like this though. We've been working steadily through them but this requires some verification work as well. There are about 300 to go that have an API marker in EMu. Agree, it would still be good idea to have the record of data issues as some might genuinely be around the wrong way, as well as id data probs elsewhere.

... so clearly it's not a good idea for the ETL pipeline to switch the dates around.

f27wood commented 6 years ago

Instead of removing the whole record, exclude the date fields.

f27wood commented 5 years ago

Data issue for NMA to resolve.