Closed ghukill closed 5 years ago
Going to the validation tab and selected 'run new validations for this job' will update the validity status. Need to find out whether this bug happens only when the job has zero records, or can also happen when there are some records already.
I think the root cause here is that re-running the job doesn't force a recount of the ~records~ validation results, which seems obviously incorrect.
I thought I might be able to fix it by wiping out ~job_details_dict['detailed_record_count']
~ job_details_dict['validation_results']
in prepare_for_rerunning
. This seemed like a brute-force solution to a structural problem, although I can't say that I grok the structural problem and therefore I probably shouldn't refactor it in haste.
This conclusion is supported by the fact that the change seems not to have worked. Any ideas?
It looks like we are recording the job validation failure counts in two ways: on the JobValidation and in the job_details_dict. That might be part of the issue?
After some chatting and looking, looks as though validation_results
is not getting removed from a Job's job_details
JSON when re-running. This prevents a re-calculation of validation results on Job details loading, preventing indication of the Job now being invalid.
Solution is probably to dump validation_results
from job_details
on Job re-run, much like they are dropped when running new validation (e.g. https://github.com/MI-DPLA/combine/blob/a6de141440bebcc00b91500b30b653e9f475c6d6/core/tasks.py#L961)
We probably need to dump the failure counts in the same way, since they're cached similarly
Believe that. I wonder if this supports a method for Job (or CombineJob), something along the lines of clear_job_details_caches
that would clear these?
Unfortunately, don't want to drop job_details entirely, obviously. But could probably add to that list of keys in that JSON object that should be cleared.
Only problem is that dropping all caches isn't necessary. If running new validations, don't need to drop calculated field metrics, and vice versa. But, maybe the performance hit of regenerating those would be worth the simplicity of dropping all cached results?
Hate to even suggest it, would require some reworking, but could stash all these in a key called something like calculated_caches
in job_details
that could then be dropped wholesale? Just thinking out loud here.
job_details
grew organically as features were added, and it's gotten a bit cumbersome. It's terribly helpful to store these details which allow for job re-running, and cache expensive calculations, but it might need a refactoring.
I think that clearing all the caches on job re-run is a small price to pay for (a) getting proper update behavior and (b) not needing to re-do expensive calculations on job display, which seems to me to be what we really want to avoid.
Agreed that clearing all caches in favor of accurate update behavior is first and foremost. And I think you're right: clearing all caches for a job re-run
only, but perhaps more selectively for individual updates (like new/remove validations, or re-indexing), would allow those more surgical tasks to be recalculated only where it made sense.
Example: an OAI job is run with an incorrect set name, resulting in zero records. But, on updating Job params and rerunning, while validations are accrued for the Job, the Job as a whole is still considered valid.