BarnesFoundation / barnes-tms-extract

Barnes Foundation Collection Website
GNU General Public License v3.0
31 stars 4 forks source link

Update nightly cron jobs & ES import scripts #22

Closed rnackman closed 6 years ago

rnackman commented 7 years ago

@micahwalter : Rundown on what I think is happening right now with our nightly cron jobs and ES updating.

At this point, our ES index should contain a current copy of all TMS object data.

Once all nightly processes are complete, we have an ES index that contains current TMS object data as well as supplementary color and image data for each object.

SO: This means that the next time our cron jobs run, the validation script will always fail due to the additional fields in the ES object documents. So we will end up wiping and rewriting our ES index every night, as well as processing and storing all supplementary data again.

Ways forward:

micahwalter commented 7 years ago

I think it should work like this:

  1. Nightly diffs

    • Data is exported from TMS on a nightly basis to CSV. ( This is already happening )
    • Data in the latest CSV is compared to data in last night's CSV and a diff file is created ( also already happening )
    • TMS data in each record in the diff file gets updated in ES ( leaving all additional data intact )
    • Any records in the diff file should get images reprocessed along with color data, and that data should be updated in ES.
  2. Manual re-index

    • The scripts should be so that they can be run manually to blow away ES and re-index everything from scratch.
  3. Let's ditch the nightlyValidate, because what is it really getting us?

rnackman commented 7 years ago

All sounds good to me.

If we want to keep some very basic validation to make sure things are written properly, we could just leave in place the part where nightlyValidate checks number of ES records against number of CSV rows. That's not an issue.