Update nightly cron jobs & ES import scripts

rnackman commented 7 years ago

@micahwalter : Rundown on what I think is happening right now with our nightly cron jobs and ES updating.

First, we export all object data from TMS and store it as a set of CSVs. We store CSV sets for the past 15 days; our nightly script looks for csv_ directories older than 15 days and deletes them before beginning the TMS export process. (So any object data that will be updated less frequently must be stored in directories that do not start with csv_.)
We look for the latest set of CSVs previously imported into Elasticsearch (i.e., yesterday’s CSVs). If there are no previously-imported CSVs, or if those CSVs are more than 15 days old, the ES index is rebuilt from scratch using today’s CSVs.
Otherwise, we compare the two sets of CSVs and create a JSON diff file reflecting any changes. We use this JSON diff file to update the ES index. We also update the metadata for today’s CSVs to indicate that they were successfully imported into ES.

At this point, our ES index should contain a current copy of all TMS object data.

To confirm this, we run a nightly validation script (nightlyValidate.js) to check that today’s CSVs are properly reflected in ES. This script first compares the number of records in the CSV set to the number of records in the ES index. If that checks out, we do a deep object comparison between each row in the CSV and each document in the ES index. If we find any inequalities, our validation script fails.
A failed validation triggers a complete wipe of our ES index. We then rebuild the index using today’s set of CSVs, and we try to validate again.
After the import and validation scripts are complete, we move on to supplementary data processing. Each script walks through today’s CSV file, updating ES as it goes:
- nightlyColorProcess.js flattens and stores color data for each object
- saveImageKeysToEs.js retrieves and stores image keys from S3 on each object
- This is where we would conceivably import Ahmed’s computer vision results.

Once all nightly processes are complete, we have an ES index that contains current TMS object data as well as supplementary color and image data for each object.

SO: This means that the next time our cron jobs run, the validation script will always fail due to the additional fields in the ES object documents. So we will end up wiping and rewriting our ES index every night, as well as processing and storing all supplementary data again.

Ways forward:

We decide this isn’t a problem. We add another nightly import script that will handle all computer vision CSVs, processing data and updating ES. Going this route means that the work we do to diff exported TMS CSVs is unnecessary, because the ES index will always be rewritten, irrespective of any comparison. So we could eliminate that step.
We decide this isn’t a problem, but we change the cron interval from nightly to weekly or biweekly or monthly -- whatever is appropriate for the pace of TMS updates. Same situation as above, but here we’ll want to make sure we adjust the script that removes outdated CSVs accordingly.
We adjust the validation script so that it compares only TMS-related ES fields to the CSVs, passing over all supplementary fields (color data, image keys, etc.). This means that the validation script will pass (unless there is a real issue) and we won’t rewrite our ES index unless that’s necessary. This also means that we can consider running color processing and imports of computer vision data less frequently or ad hoc. (Why are we running color processing nightly right now? Is it because this data consistently gets overwritten in ES?)
In any of the above cases, we need to write an import script to parse Ahmed’s computer vision data and write it to ES.

micahwalter commented 7 years ago

I think it should work like this:

Nightly diffs
- Data is exported from TMS on a nightly basis to CSV. ( This is already happening )
- Data in the latest CSV is compared to data in last night's CSV and a diff file is created ( also already happening )
- TMS data in each record in the diff file gets updated in ES ( leaving all additional data intact )
- Any records in the diff file should get images reprocessed along with color data, and that data should be updated in ES.
Manual re-index
- The scripts should be so that they can be run manually to blow away ES and re-index everything from scratch.
Let's ditch the nightlyValidate, because what is it really getting us?

rnackman commented 7 years ago

All sounds good to me.

If we want to keep some very basic validation to make sure things are written properly, we could just leave in place the part where nightlyValidate checks number of ES records against number of CSV rows. That's not an issue.

BarnesFoundation / barnes-tms-extract

Update nightly cron jobs & ES import scripts #22