Job on rerun with invalid records might remain valid as a Job

MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform

MIT License

26 stars 11 forks source link

Job on rerun with invalid records might remain valid as a Job #379

Closed ghukill closed 5 years ago

ghukill commented 5 years ago

Example: an OAI job is run with an incorrect set name, resulting in zero records. But, on updating Job params and rerunning, while validations are accrued for the Job, the Job as a whole is still considered valid.

antmoth commented 5 years ago

Going to the validation tab and selected 'run new validations for this job' will update the validity status. Need to find out whether this bug happens only when the job has zero records, or can also happen when there are some records already.

antmoth commented 5 years ago

Run job in such a way that it ends up with 0 records: shows as valid
Re-run job in such a way that it ends up with > 0 invalid records: shows as valid
Re-run job so that it continues to have > 0 invalid records: shows as valid

antmoth commented 5 years ago

I think the root cause here is that re-running the job doesn't force a recount of the ~records~ validation results, which seems obviously incorrect.

I thought I might be able to fix it by wiping out ~job_details_dict['detailed_record_count']~ job_details_dict['validation_results'] in prepare_for_rerunning. This seemed like a brute-force solution to a structural problem, although I can't say that I grok the structural problem and therefore I probably shouldn't refactor it in haste.

This conclusion is supported by the fact that the change seems not to have worked. Any ideas?

antmoth commented 5 years ago

It looks like we are recording the job validation failure counts in two ways: on the JobValidation and in the job_details_dict. That might be part of the issue?

ghukill commented 5 years ago

After some chatting and looking, looks as though validation_results is not getting removed from a Job's job_details JSON when re-running. This prevents a re-calculation of validation results on Job details loading, preventing indication of the Job now being invalid.

Solution is probably to dump validation_results from job_details on Job re-run, much like they are dropped when running new validation (e.g. https://github.com/MI-DPLA/combine/blob/a6de141440bebcc00b91500b30b653e9f475c6d6/core/tasks.py#L961)

antmoth commented 5 years ago

We probably need to dump the failure counts in the same way, since they're cached similarly

ghukill commented 5 years ago

Believe that. I wonder if this supports a method for Job (or CombineJob), something along the lines of clear_job_details_caches that would clear these?

Unfortunately, don't want to drop job_details entirely, obviously. But could probably add to that list of keys in that JSON object that should be cleared.

Only problem is that dropping all caches isn't necessary. If running new validations, don't need to drop calculated field metrics, and vice versa. But, maybe the performance hit of regenerating those would be worth the simplicity of dropping all cached results?

ghukill commented 5 years ago

Hate to even suggest it, would require some reworking, but could stash all these in a key called something like calculated_caches in job_details that could then be dropped wholesale? Just thinking out loud here.

job_details grew organically as features were added, and it's gotten a bit cumbersome. It's terribly helpful to store these details which allow for job re-running, and cache expensive calculations, but it might need a refactoring.

antmoth commented 5 years ago

I think that clearing all the caches on job re-run is a small price to pay for (a) getting proper update behavior and (b) not needing to re-do expensive calculations on job display, which seems to me to be what we really want to avoid.

ghukill commented 5 years ago

Agreed that clearing all caches in favor of accurate update behavior is first and foremost. And I think you're right: clearing all caches for a job re-run only, but perhaps more selectively for individual updates (like new/remove validations, or re-indexing), would allow those more surgical tasks to be recalculated only where it made sense.