User Data Validation Improvement

FEATURE REQUEST

Inspiration

From learning about cross-database relation exceptions and the difficulty surrounding using multiple databases (i.e. the validation database) to handle load scripts with side-effects, we discover we have a choice to make on how to handle the validation of data without committing that data to the database.

Manually including .using(db) clauses introduces undesirable overhead, and turns out to have challenges relating to instance methods that make new queries (e.g. any method that makes a call like Model.objects...). Those fresh database queries will get the default database even though the object containing the method was created using a query to the validation database.

In learning about how to handle this, it was discovered that you can override Django's default "database router" by creating your own derived class that overrides its 4 methods. This would make all the usages of .using(db) clauses unnecessary. All you would have to do is tell the router to direct queries to the validation database.

However, the only reason we're using a second database in the first place is 2-fold:

The load scripts have side-effects (i.e. the dry-run/debug mode makes changes to the database).
Subsequent load script runs depend on the previous runs having committed their data to the database.

There's also the issue of the load scripts stopping mid-load, causing the user to have to run the validation iteratively to learn of each issue 1 by 1, so either way, the load scripts need to be refactored.

Description

The choices for implementation are:

Implement a database router (and remove all the .using(db) clauses and any customized code to work around other issues, like calls to the method in MultiDBMixin [in a coming quick-fix PR])
Refactor the load scripts to not have database side-effects, then refactor the validation view to wrap all the load scripts in an atomic transaction

Alternatives

See the description. A choice needs to be made before this issue can proceed.

Dependencies

none

Comment

A PR with a quick fix is coming. There is a stack post that goes into more detail about the cross-db relations and possible solutions.

The database router solution should make it unnecessary to put calls to full_clean behind a conditional so that it only runs when the database is the default database.

ISSUE OWNER SECTION

Assumptions

none

Requirements

This will be broken down into phases per load script to produce PRs that are more readily consumable. Each phase should be accompanied by new and all-working tests.

Phase 1
- [x] 1. Load scripts generate as many actionable errors as possible in 1 run
- [x] 2. If a load action in the script requires input involved in a previous error, skip that load action
- [x] 3. Make it possible to control the verbosity of the loads
- [x] 4. Raise error on unknown headers
Phase 2
- [x] 5. Make it possible to defer mass auto-updates outside of the load scripts (so it can be run by either load_study or load_study_set)
Phase 3
- [x] 6. Dry-run mode has no database side-effects when there's an exception or when run in dry run mode (and specifically when not run in --validate mode)
- [x] 6.1. Wrap loading code in an atomic transaction
Phase 4
- [x] 7. Validation page works by running in dry-run mode on the main database
Phase 5
- [x] 8. Remove the validation database
- [x] 8.1. Remove .using(db), .save(using=db), and if self.db == settings.TRACEBASE_DB: rec.full_clean()
- [x] 8.2. Revert database settings
- [x] 8.3. Update docs
Phase 6 OPTIONAL
- [x] 9. Add the ability to run load_study inside the validation interface instead of just individual loaders

Limitations

none

Affected Components

change: DataRepo/management/commands/load*
change: DataRepo/utils/*load*
change: Tracebase/settings.py

DESIGN

Interface Change description

No changes in user experience other than receiving more errors in 1 run on the validation page. The --database and --validate options will go away on the command line. --debug will change to --dryrun. An option to disable auto-update will be added.

Code Change Description

The load scripts will be changed to buffer all errors in a self.errors array attribute. Different model loads will be encapsulated in methods, with required inputs (some of which may require successful prior model loads). If a required input is from a previous failed model load, the model load is skipped so as to not output useless errors. There will be a new requirement to report unrecognized (optional) headers as an error.

Tests

[x] 1. Load scripts generate as many actionable errors as possible in 1 run - test that self.errors can contain multiple errors
[x] 2. If a load action in the script requires input involved in a previous error, skip that load action - test that (e.g.) a tissue error will not be followed by a sample error in the same run
[x] 3. Make it possible to control the verbosity of the loads - unnecessary test
[x] 4. Raise error on unknown headers - input a bad header for an optional column and check that an error about it is raised
[x] 5. Make it possible to defer mass auto-updates outside of the load scripts (so it can be run by either load_study or load_study_set) - Try deferring auto-updates using the new option and check that the buffer is populated after the run
[x] 6. Dry-run mode has no database side-effects (when not run in --validate mode) - test unnecessary (see 6.1)
- [x] 6.1. Wrap loading code in an atomic transaction - Run a successful load and check that nothing in the DB has changed
[x] 7. Validation page works on the main database - existing tests should suffice
[x] 8. Remove the validation database - tests unnecessary
- [x] 8.1. Remove .using(db), .save(using=db), and if self.db == settings.TRACEBASE_DB: rec.full_clean() - tests unnecessary
- [x] 8.2. Revert database settings - tests unnecessary
- [x] 8.3. Update docs - tests unnecessary
[x] 9. Add the ability to run load_study inside the validation interface instead of just individual loaders

I've been reading about Django's database routers this morning on stack and in the django docs:

Multiple Databases in Django 3.2

It looks like the database router class may not be well suited for having a process temporarily use a different database. It looks like it's for having different models use different databases. I wondered about possibly creating either a class variable or an instance variable to use to switch databases, but I don't think a class variable would be restricted to just the validation view load script runs, and an instance variable would have to be set for every model object involved (which is back to the same annoying need for having .using(db) everywhere...

So I'm leaning toward a refactor to accomplish a few things:

Gather as many errors as possible to spit back in 1 run, to cut down on the need for successive runs
Make dry-runs not have any side-effects
Adjust methods to allow things to run inside an atomic transaction block (so that there aren't exceptions about queries being performed)
Turn off auto-updates during a dry run of multiple scripts and perform buffered autoupdates once they leave the atomic transaction block

I have a post in the Django forum asking about routers and if they can handle our use-case, because this refactor would be a much larger undertaking in comparison. If the database router could fix the issues with multiple databases, I think it would be a relatively small effort. And someone replied while I was composing this comment. He said you can accomplish the goal of a separate database, but by launching a separate task (e.g. using celery) that is configured to use a different database. He confirmed my doubts that a database router could do this. And knowing how much overhead there is associated with tasks/celery, I think that we should go the refactor/atomic-transaction route.

The separate database strategy could still work, but it would require that the load scripts don't call model methods that make fresh queries... And I also don't know how to deal with full_clean. Either way, a refactor would be necessary to at least better consolidate errors for users.

Definitely leaning towards a refactor that removes the extra validation database, if the router is not process specific. Fills me with unease that I don't know what database I might be addressing (or that it could switch, depending on the context).

Yeah, I have similar concerns. That's why I edited my latest PR to comment out the link to the validation page. Though the guy on the Django forum did say that you can make it process-specific by launching a new process with a separate settings file. That would certainly be quicker than a refactor, but I feel like there are other advantages to a refactor as well, so I started trying out a small refactor to at least buffer "some" errors in the sample loader.

Oh, it looks like I said that already. But he did say you can just make it a simple child process. Celery is for process communication, which this wouldn't need.

ToDo, phase 4
- [x] Add test for when there is an exception from the accucor file and none from the sample file
- [x] Catch the ValidationError exceptions that happen where there are duplicate compound/label count issues and ignore them, so processing may proceed
- [x] Add tests to ensure the database doesn't change
- [x] See if I can include proper tracebacks in the aggregated exceptions (current work-around is saving it as a string in exception.tb_str). Might need to always do a raise/except everywhere.
- [x] OPTIONAL: Move the cull_warnings and buffer_exception methods to a TraceBaseLoader superclass or other

Incidentally, regarding the "proper tracebacks" item in the TODO list above, I gained insight into tracebacks this morning that I didn't understand before. A traceback isn't created at the creation of the exception object. It's not created at the moment it's raised. And it's not created when it's caught. It's built from the bottom up, as the exception travels up the stack, thus it stops being built when it's caught.

I guess I never really thought about it before. To me, it was a black box. I think I implicitly assumed that when the exception object was constructed, it constructed the whole traceback, and that when you caught an exception, it sliced the difference between the exception point and the catch. And I never scrutinized my assumption. I don't think I ever even knew I was assuming it. And I think I was applying my knowledge of the caller() function in perl, which I understood(/assumed?) to keep a constant record of the call stack.

I gained this insight thanks to the person who answered my stack question about why I wasn't getting the full traceback when I buffered an exception. He was spot on in understanding the (implicit) assumption I was making.

After running the new cold exposure data through the dev validate interface in my sandbox, I have a few things I want to address. Perhaps I'll do this in phase 4.5...

[x] Figure out the reason behind an empty sample name in DuplicateValues errors and create a specific error
[x] Study names in the DuplicateValues error are broken - fix
[x] Simplify sample name report - Including study name this way is confusing
[x] Filter units from infusion_rate values and issue a warning when they're included
[x] Create a separate file input for isocorr files
[x] Line break missing samples and duplicate Isotopes
[x] Label the atom counts in the dupe isotopes error for clarification
[x] Make more informative Dupe MSRun errors
[x] Report MissingSamplesError in the context of the sample file instead of database
[x] Create MissingCompounds error that lists the compounds
~[ ] Change duplicate values errors about sample names into warnings if they contain "blank"~
- [x] Skip rows where the animal name is empty and buffer an exception about a funky sheet merge

A few things that were not reported that I noticed:

~[ ] I didn't see any errors about the date format. The first 29 rows of the sample sheet had m/d/y, but should have been yyyy-mm-dd~
- [x] There doesn't appear to be an issue with d/m/y formats. It gets converted correctly by the parser. I swear I'd seen an issue before. Create an issue to flesh this out, i.e. why did I even bring this up and make the change in the file?
[x] Create an issue to add a contributors page. (If you don't have a full warning exception on hand, but know there's an unknown researcher, you might look in the file and find a name you want to check, but there's no way to get a list of researchers in the DB. Seems important to have to be able to look people up for details on their data.)
[x] Compounds load stops dead when a compound in the file already exists - allow it to continue
[x] Buffer errors in the compounds loader using AggregatedErrors
[x] If a protocol load fails, it stops dead and rolls back. I should aggregate errors there and perhaps skip sample table errors about missing animal treatments. E.g.:

CommandError: 1 errors loading protocol records from /var/folders/hb/237l358561nbh3zh5fpjwccm0000gn/T/tmpsfyxahv_.upload.xlsx - NO RECORDS SAVED: ValidationError in the default database on data row 1, creating animal_treatment record for protocol 'no treatment' with description 'No treatment was applied to the animal. Animal was housed at room temperature with a normal light cycle.': ["Protocol with name = 'no treatment' but a different description already exists: Existing description = 'no manipulation besides what is already described in other fields' New description = 'No treatment was applied to the animal. Animal was housed at room temperature with a normal light cycle.'"] ... DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 4C acute' DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 4C acute' DoesNotExist: Could not find 'animal_treatment' protocol with name 'thermoneutrality chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'thermoneutrality chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 6C chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 6C chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'thermoneutrality chronic'

[x] I should suggest submitting as an isocorr file here: FAILED: 13C-glycerol metabolomics_cor.xlsx CorrectedCompoundHeaderMissing: Compound header [Compound] not found in the accucor corrected data. Did you forget to provide --isocorr-format?

[x] Allow "blank" substring match to be case insensitive:

FAILED: col005c_neg_lowmz_corrected.xlsx MissingSamplesError: 3 samples are missing in the database: [col005c_HilicBlank2_scan1, col005c_HilicBlank3_scan1, col005c_LipidBlank_scan1]. Samples must be loaded prior to loading mass spec data.

[x] This error says there are 21 missing compounds, but there are 22 in the file and none of them are in the consolidated compound list... Unless one exists only in a study...

FAILED: col005c_neg_lowmz_others_corrected.xlsx AssertionError: 21 compounds are missing.

[x] If there is a DupeCompoundIsotopeCombos error, the following ValidationErrors should be ignored:

FAILED: col005c_plasma_hilic_corrected.xlsx DupeCompoundIsotopeCombos: The following duplicate compound/isotope combinations were found in the corrected data: [lactate & 0 on rows: 91,95; lactate & 1 on rows: 92,96; lactate & 2 on rows: 93,97; lactate & 3 on rows: 94,98] ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']}

[x] Investigate the following TypeError:

FAILED: col005d_tissue WAT_corrected.xlsx DupeCompoundIsotopeCombos: The following duplicate compound/isotope combinations were found in the corrected data: [lactate & 0 on rows: 148,152; lactate & 1 on rows: 149,153; lactate & 2 on rows: 150,154; lactate & 3 on rows: 151,155] TypeError: argument of type 'NoneType' is not iterable

[x] If a sample is used in multiple MSRuns in the same study, the validation code issues an error for every sample because it uses the same date for every load. Use a different date for every load. The errors look like this:

FAILED: col005e2_glucose tissues_f16bp_corrected.xlsx ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ...

[x] Create a specific exception for the above case that includes the accucor file(s) that is(/are) the source of the conflict, and the field values of the unique constraint, in the DB associated with the MSRun
- [x] Add a test for the exception
~[ ] Create a specific exception for this case:~ (This has been changed, but a specific exception should still be created - just not as a part of this issue)
- [x] Create an issue to make a custom exception for the ValidationErrors raised in PeakDataLabel.clean() and TracerLabel.clean()

STATUS: FAILURE col005c_neg_lowmz_corrected.xlsx AggregatedErrors Summary (1 errors / 0 warnings): EXCEPTION1(ERROR): TypeError: argument of type 'NoneType' is not iterable

The above caught exception had a partial traceback: Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 927, in load_data peak_data_label.full_clean() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 1238, in full_clean raise ValidationError(errors) django.core.exceptions.ValidationError: {'count': ['Ensure this value is less than or equal to 20.']}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1217, in load_accucor_data self.load_data() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 932, in load_data if self.is_a_downstream_dupe_error( File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1038, in is_a_downstream_dupe_error if row in self.dupe_isotope_rows[sheet] and ( TypeError: argument of type 'NoneType' is not iterable TypeError: argument of type 'NoneType' is not iterable

~[ ] Fix: There was one case where the above was happening, but the file did not appear to contain any counts that were in excess of 20:~

STATUS: FAILURE col005c_neg_lowmz_corrected.xlsx

AggregatedErrors Summary (1 errors / 0 warnings): EXCEPTION1(ERROR): TypeError: argument of type 'NoneType' is not iterable Buffered Error: File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/manage.py", line 22, in main() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/manage.py", line 18, in main execute_from_command_line(sys.argv) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/management/commands/load_study.py", line 233, in handle call_command( File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/management/commands/load_accucor_msruns.py", line 142, in handle loader.load_accucor_data() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1233, in load_accucor_data self.aggregated_errors_object.buffer_error(e) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/exceptions.py", line 341, in buffer_error self.buffer_exception(exception, is_error=True, is_fatal=is_fatal) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/exceptions.py", line 304, in buffer_exception buffered_tb_str = self.get_buffered_traceback_string() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/exceptions.py", line 288, in get_buffered_traceback_string for step in traceback.format_stack()

The above caught exception had a partial traceback: Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 927, in load_data peak_data_label.full_clean() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 1238, in full_clean raise ValidationError(errors) django.core.exceptions.ValidationError: {'count': ['Ensure this value is less than or equal to 20.']}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1217, in load_accucor_data self.load_data() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 932, in load_data if self.is_a_downstream_dupe_error( File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1038, in is_a_downstream_dupe_error if row in self.dupe_isotope_rows[sheet] and ( TypeError: argument of type 'NoneType' is not iterable TypeError: argument of type 'NoneType' is not iterable

[x] Include a row list in the RequiredValueError message:

RequiredValuesError: Missing required values have been detected in the following columns: Researcher Name Collection Time Study Name Animal ID Animal Body Weight Age Sex Animal Genotype Feeding Status Diet Infusate Infusion Rate Tracer Concentrations Sample Name Note, entirely empty rows are allowed, but having a single value on a row in one sheet can cause a duplication of empty rows, so be sure you don't have stray single values in a sheet.

So it turns out that the MAX_LABELED_ATOMS has to be the max number of atoms in the compound with the most atoms that have a label applied. In the cold exposure data, that was fatty acid C34:1, which has a formula of C34H66O2. So I should:

[x] Create a specific exception that reports the issue described above and it should include the formula, and why it's creating a record with a label count higher than MAX_LABELED_ATOMS. It should catch this case:

ValidationError: {'count': ['Ensure this value is less than or equal to 30.']}

All of the todo items in the comment above are now complete. There remain a couple of housekeeping items I want to keep in mind as I move on to phase 5:

[x] Run the original Cold Exposure data through to make sure everything is handled correctly
[x] Make a final pass on the code changes to remove debug prints, commented code, and fill in any docstrings where methods are not entirely clear
[x] Clean up the aggregated errors print so as not to duplicate the exception message
[x] Better decide when to print the AggregatedErrors summary and traces
[x] Eliminate caveats about side-effects in debug mode, e.g. # TODO: This DOES change the database. See comments on issue #345
[x] Change "--debug" options to "--dry-run" for consistency/accuracy
[x] Intersperse errors and warnings in the template

Validation page improvements 5:

[x] Protocol loader
- [x] Reports the wrong file name: CommandError: 1 errors loading protocol records from /var/folders/hb/237l358561nbh3zh5fpjwccm0000gn/T/tmpov4x70dq.upload.xlsx
- [x] Implement AggregatedErrors
[x] NoSamplesError/MissingCompounds/MissingSamplesError - Change the context from saying samples/Compounds must be loaded to the samples not being in the file
[x] DupeCompoundIsotopeCombos - delimit row numbers with ", " (i.e. add a space)
[x] Remove the artificial date changes. They should now be reported as needing a prefix
[x] Fix: The error in col005c_plasma_hilic_corrected.xlsx is a duplicate Compound/Isotope combo that should have been skipped: ValidationError: {'corrected_abundance': ['“90 3.000190e+07\n94 0.000000e+00\nName: col005c_A1b, dtype: float64” value must be a float.']}
- [x] This is a bug. orig_row_idx doesn't necessarily indicate rows where we run into duplicates in the orig data. We run into duplicates in the corrected data that is processed in concert with the orig data, and in that context, we don't have the row index for the corrected data. It's being indexed by column name and filtering. I should probably try and skip that accucor data entirely or track it via compound and label count instead of index. Also note that the error is WRT the corrected data even though it's inside the original sheet conditional
- [x] DupeCompoundIsotopeCombos errors are issued twice (once for each sheet - see set1_negative_corrected.xlsx), but it's the same data issue

Next level validation improvements

~[ ] Wrap all ValidationsErrors, IntegrityErrors, and all unanticipated exceptions in a generic exception that indicates the row of the input file where the exception was encountered and the data on that row. It should also suggest creating a custom exception.~
- [x] Created an issue #646
~[ ] Break up the lines (soft-wrap?) in the error output or put inside a div~
- [x] Created an issue #647
~[ ] Consider reporting all missing compounds and samples together and changing the reporting to list them alphabetically followed by file and row number, either on the same line or indented~
- [x] Created an issue #648

[x] Improve ExistingMSRun to not list samples if all samples in common with another accucor/isocorr file account for all samples in the current accucor/isocorr file
[x] Add AggregatedErrors to tissues_loader

New changes in #678 were previously overlooked. Re-opening.

Princeton-LSI-ResearchComputing / tracebase