Rework publishing - Githubissues

ghukill commented 5 years ago

Testing in the 1,3,6,12 million Record ranges have shown some scaling issues at various loactions:

counting metrics for mapped fields in ES
- for Job, fix was to calculate once and then save under job_details
- but what about Published Jobs? Where Records are added/removed from published index?
selecting subsets of Records based on Job's, Record Group's publish_set_id
- fix was to move publish_set_id to Record level
- this removes some purpose and functionality of publishing at Record Group level

Furthermore, have an instance of Combine with very different types of Jobs / Records. Where the "Published" Records section used to feel like the place to view all Records that were on the way out, and had some affinity, now it feels artificial to have an ES index where they are mixed. If Jobs 1-4 are for purpose Foo, and Jobs 5-8 are for purpose Bar, why have an ES index combining these completely unrelated Jobs's mapped fields? It doesn't scale logistically, and it doesn't scale conceptually.

All this suggests reworking publishing. In many ways, simplifying it. The goal will be to publish at the Job level, not the RecordGroup level. Unsure if JobPublish model is still needed, that formally united RecordGroup and Job. Might be sufficient to just set flag on Job.

When a Job is published:

background task is kicked off that sets:
- publish_set_id for each Record (does Job need one?)
- published flag for Job

When that's done, easy to see what Jobs are published for a Record Group by looking for Jobs with published == True.

Then, for all published Jobs, do the same.

Published Records page will show:

similar Records table, as it will be easy to filter all Records by those that are published (or come from Published Job)
no Mapped Fields / ES table, for reasons outlined above
OAI, flat export, these will function as before
- might still be possible to export flat, mapped fields of all Records from their own ES index? Seem to recall that ability to pass a list and/or regex of indices

Planning on removing PublishJobs entirely then, as it will become a "state" of other Jobs if they are published or not. Might consider turning the Job blue if published, but not much beyond that.

ghukill commented 5 years ago

Progress:

removed PublishJobs entirely, from core.models and core.spark.jobs
moved publishing/unpublishing of Jobs to methods and background tasks

ghukill commented 5 years ago

Need to address unique_in_published field in Record

ghukill commented 5 years ago

Fix "published" column in Organization page

ghukill commented 5 years ago

Logistically, mostly complete.

What remains unresolved are slow MySQL queries for setting Records as published, and setting their publish_set_id. For most Jobs, it's very quick. But large Jobs, at least one 3.5 million record Job, was prohibitively slow.

Keeping issue open.

ghukill commented 5 years ago

Needs updates to documentation...

ghukill commented 5 years ago

Done.

MI-DPLA / combine

Rework publishing #255