Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

User Data Validation Improvement #580

Closed hepcat72 closed 1 year ago

hepcat72 commented 2 years ago

FEATURE REQUEST

Inspiration

From learning about cross-database relation exceptions and the difficulty surrounding using multiple databases (i.e. the validation database) to handle load scripts with side-effects, we discover we have a choice to make on how to handle the validation of data without committing that data to the database.

Manually including .using(db) clauses introduces undesirable overhead, and turns out to have challenges relating to instance methods that make new queries (e.g. any method that makes a call like Model.objects...). Those fresh database queries will get the default database even though the object containing the method was created using a query to the validation database.

In learning about how to handle this, it was discovered that you can override Django's default "database router" by creating your own derived class that overrides its 4 methods. This would make all the usages of .using(db) clauses unnecessary. All you would have to do is tell the router to direct queries to the validation database.

However, the only reason we're using a second database in the first place is 2-fold:

There's also the issue of the load scripts stopping mid-load, causing the user to have to run the validation iteratively to learn of each issue 1 by 1, so either way, the load scripts need to be refactored.

Description

The choices for implementation are:

  1. Implement a database router (and remove all the .using(db) clauses and any customized code to work around other issues, like calls to the method in MultiDBMixin [in a coming quick-fix PR])
  2. Refactor the load scripts to not have database side-effects, then refactor the validation view to wrap all the load scripts in an atomic transaction

Alternatives

See the description. A choice needs to be made before this issue can proceed.

Dependencies

none

Comment

A PR with a quick fix is coming. There is a stack post that goes into more detail about the cross-db relations and possible solutions.

The database router solution should make it unnecessary to put calls to full_clean behind a conditional so that it only runs when the database is the default database.


ISSUE OWNER SECTION

Assumptions

none

Requirements

This will be broken down into phases per load script to produce PRs that are more readily consumable. Each phase should be accompanied by new and all-working tests.

Limitations

none

Affected Components

DESIGN

Interface Change description

No changes in user experience other than receiving more errors in 1 run on the validation page. The --database and --validate options will go away on the command line. --debug will change to --dryrun. An option to disable auto-update will be added.

Code Change Description

The load scripts will be changed to buffer all errors in a self.errors array attribute. Different model loads will be encapsulated in methods, with required inputs (some of which may require successful prior model loads). If a required input is from a previous failed model load, the model load is skipped so as to not output useless errors. There will be a new requirement to report unrecognized (optional) headers as an error.

Tests

hepcat72 commented 2 years ago

I've been reading about Django's database routers this morning on stack and in the django docs:

It looks like the database router class may not be well suited for having a process temporarily use a different database. It looks like it's for having different models use different databases. I wondered about possibly creating either a class variable or an instance variable to use to switch databases, but I don't think a class variable would be restricted to just the validation view load script runs, and an instance variable would have to be set for every model object involved (which is back to the same annoying need for having .using(db) everywhere...

So I'm leaning toward a refactor to accomplish a few things:

I have a post in the Django forum asking about routers and if they can handle our use-case, because this refactor would be a much larger undertaking in comparison. If the database router could fix the issues with multiple databases, I think it would be a relatively small effort. And someone replied while I was composing this comment. He said you can accomplish the goal of a separate database, but by launching a separate task (e.g. using celery) that is configured to use a different database. He confirmed my doubts that a database router could do this. And knowing how much overhead there is associated with tasks/celery, I think that we should go the refactor/atomic-transaction route.

hepcat72 commented 2 years ago

The separate database strategy could still work, but it would require that the load scripts don't call model methods that make fresh queries... And I also don't know how to deal with full_clean. Either way, a refactor would be necessary to at least better consolidate errors for users.

jcmatese commented 2 years ago

Definitely leaning towards a refactor that removes the extra validation database, if the router is not process specific. Fills me with unease that I don't know what database I might be addressing (or that it could switch, depending on the context).

hepcat72 commented 2 years ago

Yeah, I have similar concerns. That's why I edited my latest PR to comment out the link to the validation page. Though the guy on the Django forum did say that you can make it process-specific by launching a new process with a separate settings file. That would certainly be quicker than a refactor, but I feel like there are other advantages to a refactor as well, so I started trying out a small refactor to at least buffer "some" errors in the sample loader.

hepcat72 commented 2 years ago

Oh, it looks like I said that already. But he did say you can just make it a simple child process. Celery is for process communication, which this wouldn't need.

hepcat72 commented 1 year ago
hepcat72 commented 1 year ago

Incidentally, regarding the "proper tracebacks" item in the TODO list above, I gained insight into tracebacks this morning that I didn't understand before. A traceback isn't created at the creation of the exception object. It's not created at the moment it's raised. And it's not created when it's caught. It's built from the bottom up, as the exception travels up the stack, thus it stops being built when it's caught.

I guess I never really thought about it before. To me, it was a black box. I think I implicitly assumed that when the exception object was constructed, it constructed the whole traceback, and that when you caught an exception, it sliced the difference between the exception point and the catch. And I never scrutinized my assumption. I don't think I ever even knew I was assuming it. And I think I was applying my knowledge of the caller() function in perl, which I understood(/assumed?) to keep a constant record of the call stack.

I gained this insight thanks to the person who answered my stack question about why I wasn't getting the full traceback when I buffered an exception. He was spot on in understanding the (implicit) assumption I was making.

hepcat72 commented 1 year ago

After running the new cold exposure data through the dev validate interface in my sandbox, I have a few things I want to address. Perhaps I'll do this in phase 4.5...

A few things that were not reported that I noticed:

CommandError: 1 errors loading protocol records from /var/folders/hb/237l358561nbh3zh5fpjwccm0000gn/T/tmpsfyxahv_.upload.xlsx - NO RECORDS SAVED: ValidationError in the default database on data row 1, creating animal_treatment record for protocol 'no treatment' with description 'No treatment was applied to the animal. Animal was housed at room temperature with a normal light cycle.': ["Protocol with name = 'no treatment' but a different description already exists: Existing description = 'no manipulation besides what is already described in other fields' New description = 'No treatment was applied to the animal. Animal was housed at room temperature with a normal light cycle.'"] ... DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 4C acute' DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 4C acute' DoesNotExist: Could not find 'animal_treatment' protocol with name 'thermoneutrality chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'thermoneutrality chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 6C chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 6C chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'thermoneutrality chronic'

  • [x] I should suggest submitting as an isocorr file here: FAILED: 13C-glycerol metabolomics_cor.xlsx CorrectedCompoundHeaderMissing: Compound header [Compound] not found in the accucor corrected data. Did you forget to provide --isocorr-format?

FAILED: col005c_neg_lowmz_corrected.xlsx MissingSamplesError: 3 samples are missing in the database: [col005c_HilicBlank2_scan1, col005c_HilicBlank3_scan1, col005c_LipidBlank_scan1]. Samples must be loaded prior to loading mass spec data.

FAILED: col005c_neg_lowmz_others_corrected.xlsx AssertionError: 21 compounds are missing.

FAILED: col005c_plasma_hilic_corrected.xlsx DupeCompoundIsotopeCombos: The following duplicate compound/isotope combinations were found in the corrected data: [lactate & 0 on rows: 91,95; lactate & 1 on rows: 92,96; lactate & 2 on rows: 93,97; lactate & 3 on rows: 94,98] ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']}

FAILED: col005d_tissue WAT_corrected.xlsx DupeCompoundIsotopeCombos: The following duplicate compound/isotope combinations were found in the corrected data: [lactate & 0 on rows: 148,152; lactate & 1 on rows: 149,153; lactate & 2 on rows: 150,154; lactate & 3 on rows: 151,155] TypeError: argument of type 'NoneType' is not iterable

FAILED: col005e2_glucose tissues_f16bp_corrected.xlsx ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ...

STATUS: FAILURE col005c_neg_lowmz_corrected.xlsx AggregatedErrors Summary (1 errors / 0 warnings): EXCEPTION1(ERROR): TypeError: argument of type 'NoneType' is not iterable

The above caught exception had a partial traceback: Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 927, in load_data peak_data_label.full_clean() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 1238, in full_clean raise ValidationError(errors) django.core.exceptions.ValidationError: {'count': ['Ensure this value is less than or equal to 20.']}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1217, in load_accucor_data self.load_data() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 932, in load_data if self.is_a_downstream_dupe_error( File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1038, in is_a_downstream_dupe_error if row in self.dupe_isotope_rows[sheet] and ( TypeError: argument of type 'NoneType' is not iterable TypeError: argument of type 'NoneType' is not iterable

STATUS: FAILURE col005c_neg_lowmz_corrected.xlsx

AggregatedErrors Summary (1 errors / 0 warnings): EXCEPTION1(ERROR): TypeError: argument of type 'NoneType' is not iterable Buffered Error: File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/manage.py", line 22, in main() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/manage.py", line 18, in main execute_from_command_line(sys.argv) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/management/commands/load_study.py", line 233, in handle call_command( File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/management/commands/load_accucor_msruns.py", line 142, in handle loader.load_accucor_data() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1233, in load_accucor_data self.aggregated_errors_object.buffer_error(e) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/exceptions.py", line 341, in buffer_error self.buffer_exception(exception, is_error=True, is_fatal=is_fatal) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/exceptions.py", line 304, in buffer_exception buffered_tb_str = self.get_buffered_traceback_string() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/exceptions.py", line 288, in get_buffered_traceback_string for step in traceback.format_stack()

The above caught exception had a partial traceback: Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 927, in load_data peak_data_label.full_clean() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 1238, in full_clean raise ValidationError(errors) django.core.exceptions.ValidationError: {'count': ['Ensure this value is less than or equal to 20.']}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1217, in load_accucor_data self.load_data() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 932, in load_data if self.is_a_downstream_dupe_error( File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1038, in is_a_downstream_dupe_error if row in self.dupe_isotope_rows[sheet] and ( TypeError: argument of type 'NoneType' is not iterable TypeError: argument of type 'NoneType' is not iterable

RequiredValuesError: Missing required values have been detected in the following columns: Researcher Name Collection Time Study Name Animal ID Animal Body Weight Age Sex Animal Genotype Feeding Status Diet Infusate Infusion Rate Tracer Concentrations Sample Name Note, entirely empty rows are allowed, but having a single value on a row in one sheet can cause a duplication of empty rows, so be sure you don't have stray single values in a sheet.

hepcat72 commented 1 year ago

So it turns out that the MAX_LABELED_ATOMS has to be the max number of atoms in the compound with the most atoms that have a label applied. In the cold exposure data, that was fatty acid C34:1, which has a formula of C34H66O2. So I should:

hepcat72 commented 1 year ago

All of the todo items in the comment above are now complete. There remain a couple of housekeeping items I want to keep in mind as I move on to phase 5:

Validation page improvements 5:

Next level validation improvements

hepcat72 commented 1 year ago
hepcat72 commented 1 year ago

New changes in #678 were previously overlooked. Re-opening.