Closed rnackman closed 6 years ago
I think it should work like this:
Nightly diffs
Manual re-index
Let's ditch the nightlyValidate, because what is it really getting us?
All sounds good to me.
If we want to keep some very basic validation to make sure things are written properly, we could just leave in place the part where nightlyValidate checks number of ES records against number of CSV rows. That's not an issue.
@micahwalter : Rundown on what I think is happening right now with our nightly cron jobs and ES updating.
First, we export all object data from TMS and store it as a set of CSVs. We store CSV sets for the past 15 days; our nightly script looks for
csv_
directories older than 15 days and deletes them before beginning the TMS export process. (So any object data that will be updated less frequently must be stored in directories that do not start withcsv_
.)We look for the latest set of CSVs previously imported into Elasticsearch (i.e., yesterday’s CSVs). If there are no previously-imported CSVs, or if those CSVs are more than 15 days old, the ES index is rebuilt from scratch using today’s CSVs.
Otherwise, we compare the two sets of CSVs and create a JSON diff file reflecting any changes. We use this JSON diff file to update the ES index. We also update the metadata for today’s CSVs to indicate that they were successfully imported into ES.
At this point, our ES index should contain a current copy of all TMS object data.
To confirm this, we run a nightly validation script (nightlyValidate.js) to check that today’s CSVs are properly reflected in ES. This script first compares the number of records in the CSV set to the number of records in the ES index. If that checks out, we do a deep object comparison between each row in the CSV and each document in the ES index. If we find any inequalities, our validation script fails.
A failed validation triggers a complete wipe of our ES index. We then rebuild the index using today’s set of CSVs, and we try to validate again.
After the import and validation scripts are complete, we move on to supplementary data processing. Each script walks through today’s CSV file, updating ES as it goes:
Once all nightly processes are complete, we have an ES index that contains current TMS object data as well as supplementary color and image data for each object.
SO: This means that the next time our cron jobs run, the validation script will always fail due to the additional fields in the ES object documents. So we will end up wiping and rewriting our ES index every night, as well as processing and storing all supplementary data again.
Ways forward: