NationalMuseumAustralia / Collection-API

The public web API of the National Museum of Australia
11 stars 0 forks source link

Daily incremental updates #39

Open staplegun opened 6 years ago

staplegun commented 6 years ago

ETL process handles:

  1. Full reloads from EMu & Piction exports
  2. Daily, incremental changed records only from EMu & Piction

Incremental changes to entity records (party/place/collection/media) should ripple through to their details nested inside relevant object/narrative records.

f27wood commented 6 years ago

Noting that the following records were updated with an educational significance on Friday 17 Aug, and are yet to update in the prod API.

https://data.nma.gov.au/object/130126

https://data.nma.gov.au/object/235522 https://data.nma.gov.au/object/235906

Conal-Tuohy commented 6 years ago

Looking at https://data.nma.gov.au/object/235906 the <CreProvenenance> is present in the file, and the RDF does include statements that assert the object is the subject of a linguistic object whose identifier is <http://data.nma.gov.au/object/235906#educationalSignificance> and which in turn has an rdf:value property of "Defining Moment: Archaeological evidence of first peoples on the Australian continent (about 20,000 years ago).\nCurriculum: History\nSchool years: 4, 7".

The JSON-LD object in Solr doesn't have this educational significance data, but it does have a statement which says the Solr record is based on data last updated on the 20th: <http://data.nma.gov.au/object/235906> <http://purl.org/dc/terms/modified> "2018-08-20"^^<http://www.w3.org/2001/XMLSchema#date>, which matches the EMu record's metadata: <AdmDateModified>20/08/2018</AdmDateModified>.

The Solr ETL log shows it being updated, too: Message: 2018-08-23T16:31:20.679+10:00 copying http://data.nma.gov.au/object/235906# from public dataset ...

There are still some objects being presented to Solr with multiple modified dates; a few seconds later in the log there appears: Message: 2018-08-23T16:31:23.411+10:00 copying http://data.nma.gov.au/object/255852# from public dataset ... Message: Error depositing resource <http://data.nma.gov.au/object/255852#> in Solr

The detailed error logged for this record says: ERROR: [doc=object/255852] Error adding field 'modified'='[2018-07-09, 2018-08-22]' msg=Multiple values encountered for non multiValued copy field modified_sort: 2018-08-22

Conal-Tuohy commented 6 years ago

So we still have apparent data corruption, but the disk space issue is apparently not a factor (ETL still running, but there's still > 5GB free disk space)

f27wood commented 6 years ago

Hmm... is this only a problem on prod? In which case, what is different? And is it only a problem with the incremental update? i.e. I wonder if when the full ETL runs tonight, if the same data corruption will occur..

Conal-Tuohy commented 6 years ago

I have replicated the problem on nma-dev.conaltuohy.com, which is good. It's not a problem with incremental update (only the Solr index is updated incrementally, and this is a problem with loading data into the SPARQL store, which is always rebuilt from scratch). It can affect the incremental update of Solr because sometimes the corrupted data affects the dc:modified properties which are used to determine if a Solr record needs updating, but this is a side issue really.

Conal-Tuohy commented 6 years ago

So my current theory is a bug in the Apache Jena utility tdb2.tdbloader which is the script we use to populate the graph store. I'm trying out some alternatives on Amazon.

f27wood commented 6 years ago

Any luck with this?

Conal-Tuohy commented 6 years ago

I believe this is sorted. The data corruption problem #83 was causing some records not to update, but this was actually nothing to do with incrementalism; those records would not update in Solr either incrementally or as part of a full load, because the record itself was rejected by Solr for having > 1 datestamp value.

f27wood commented 6 years ago

tested with a new record and seems to be working AOK.