Closed Conal-Tuohy closed 5 years ago
I think we can easily enough re-order the dates when we convert the source XML to RDF, but it would be good to record when that happens to feed back to registration so the source data can be fixed. Maybe this is something we can look at more generally in the next major phase of work? i.e. we could (perhaps as part of the RDF mapping process, or perhaps subsequently using SHACL shapes to validate the RDF) record any apparent data quality issues which we can detect, and store them in the RDF of the internal
API, where you could search for them and view the errors. Then the API could help to serve the registration people's QA efforts.
Karen Peterson says:
The date range issue is more complicated than switched around and is primarily related to probably an import where the original date was mm/yyyy and was split with a '20' added to the mm for the earlier date and the yyyy moved to the latest date (eg. 11/1934 became 2011 & 1934). Weird but believe this happened sometime ago. Not all the examples are like this though. We've been working steadily through them but this requires some verification work as well. There are about 300 to go that have an API marker in EMu. Agree, it would still be good idea to have the record of data issues as some might genuinely be around the wrong way, as well as id data probs elsewhere.
... so clearly it's not a good idea for the ETL pipeline to switch the dates around.
Instead of removing the whole record, exclude the date fields.
Data issue for NMA to resolve.
Dates in the
ProEarliestDate0
andProLatestDate0
fields of EMu object records are sometimes entered in the wrong order.These fields are converted into
crm:P82a_begin_of_the_begin
andcrm:P82b_end_of_the_end
properties in the RDF data, and later these two properties are brought together to create asolr.DateRangeField
field nameddatestamp
. Solr then rejects the deposit of records whosedatestamp
fields are out of order: