NationalMuseumAustralia / Collection-API

The public web API of the National Museum of Australia
10 stars 0 forks source link

Handle file names for EMu and Piction export data #53

Closed staplegun closed 5 years ago

staplegun commented 6 years ago

Piction exports are in a single file, EMu exports are in multiple files (one for each module type). The ETL ingest step takes the entity type as an input parameter.

Currently the file names to not always match the contained entity type. So the ETL should either map from the file name to the appropriate entity type, or automatically determine the entity type (e.g. by scanning a few records). It can then call the ingest step with the appropriate entity type parameter.

Conal-Tuohy commented 6 years ago

Addressed in https://github.com/Conal-Tuohy/NMA-API-ETL/commit/3922a3fda7428b43dfb25c6104f40df9610aeccf and https://github.com/Conal-Tuohy/NMA-API-ETL/commit/b0de4ff70947b5192e4633624fa688b0c60c29c1

The data files are currently still loaded by searching for files with a particular name, but the file type is determined by actually looking at the content of the file.

Conal-Tuohy commented 6 years ago

Still to do: load piction ('solr') file from a distinct location (see info from Rick), and also just load all the EMu files from the folder; don't bother to search for files of a particular name, since we no longer need to know what the file names are in order to map them to RDF.

staplegun commented 6 years ago

Agreed process:

  1. EMu and Piction extraction runs daily (Mon-Fri) at 7:00-8:00pm, output to
    • /mnt/emu_data/full/<coded-filename>.xml
    • /mnt/dams_data/solr_prod1.xml
  2. API ETL will run daily (Mon-Fri) at 9:00pm
  3. Afterwards an archive of all files involved are placed in a job directory /mnt/emu_data/etl/<job-name>/
    • EMu files are moved
    • Piction files are copied
    • log files are copied
  4. Archived data files over 2 weeks old are deleted, archived log files over 6 months old are deleted.
SimmoK commented 6 years ago

Sounds good to me though you have the Piction filename wrong, it's just solr_prod1.xml.

SimmoK commented 6 years ago

Wait a min I'm confirming that those paths are correct, i see some conflicting doco on whether it's damsdata or dams_data. I'll confirm with Rick when he gets in soon.

SimmoK commented 6 years ago

I just confirmed with Rick they are there with underscores so the paths are /mnt/emu_data/ /mnt/dams_data/

staplegun commented 6 years ago

OK, thanks, I'll change the ETL scripts. And you are right about the piction file name, apologies. I've corrected the above comment.

f27wood commented 6 years ago

I cannot test this so moving it to done, based on your testing and knowledge of the issue.