Closed hepcat72 closed 1 year ago
I've been reading about Django's database routers this morning on stack and in the django docs:
It looks like the database router class may not be well suited for having a process temporarily use a different database. It looks like it's for having different models use different databases. I wondered about possibly creating either a class variable or an instance variable to use to switch databases, but I don't think a class variable would be restricted to just the validation view load script runs, and an instance variable would have to be set for every model object involved (which is back to the same annoying need for having .using(db)
everywhere...
So I'm leaning toward a refactor to accomplish a few things:
I have a post in the Django forum asking about routers and if they can handle our use-case, because this refactor would be a much larger undertaking in comparison. If the database router could fix the issues with multiple databases, I think it would be a relatively small effort. And someone replied while I was composing this comment. He said you can accomplish the goal of a separate database, but by launching a separate task (e.g. using celery) that is configured to use a different database. He confirmed my doubts that a database router could do this. And knowing how much overhead there is associated with tasks/celery, I think that we should go the refactor/atomic-transaction route.
The separate database strategy could still work, but it would require that the load scripts don't call model methods that make fresh queries... And I also don't know how to deal with full_clean. Either way, a refactor would be necessary to at least better consolidate errors for users.
Definitely leaning towards a refactor that removes the extra validation database, if the router is not process specific. Fills me with unease that I don't know what database I might be addressing (or that it could switch, depending on the context).
Yeah, I have similar concerns. That's why I edited my latest PR to comment out the link to the validation page. Though the guy on the Django forum did say that you can make it process-specific by launching a new process with a separate settings file. That would certainly be quicker than a refactor, but I feel like there are other advantages to a refactor as well, so I started trying out a small refactor to at least buffer "some" errors in the sample loader.
Oh, it looks like I said that already. But he did say you can just make it a simple child process. Celery is for process communication, which this wouldn't need.
exception.tb_str
). Might need to always do a raise/except everywhere.TraceBaseLoader
superclass or otherIncidentally, regarding the "proper tracebacks" item in the TODO list above, I gained insight into tracebacks this morning that I didn't understand before. A traceback isn't created at the creation of the exception object. It's not created at the moment it's raised. And it's not created when it's caught. It's built from the bottom up, as the exception travels up the stack, thus it stops being built when it's caught.
I guess I never really thought about it before. To me, it was a black box. I think I implicitly assumed that when the exception object was constructed, it constructed the whole traceback, and that when you caught an exception, it sliced the difference between the exception point and the catch. And I never scrutinized my assumption. I don't think I ever even knew I was assuming it. And I think I was applying my knowledge of the caller()
function in perl, which I understood(/assumed?) to keep a constant record of the call stack.
I gained this insight thanks to the person who answered my stack question about why I wasn't getting the full traceback when I buffered an exception. He was spot on in understanding the (implicit) assumption I was making.
After running the new cold exposure data through the dev validate interface in my sandbox, I have a few things I want to address. Perhaps I'll do this in phase 4.5...
A few things that were not reported that I noticed:
CommandError: 1 errors loading protocol records from /var/folders/hb/237l358561nbh3zh5fpjwccm0000gn/T/tmpsfyxahv_.upload.xlsx - NO RECORDS SAVED: ValidationError in the default database on data row 1, creating animal_treatment record for protocol 'no treatment' with description 'No treatment was applied to the animal. Animal was housed at room temperature with a normal light cycle.': ["Protocol with name = 'no treatment' but a different description already exists: Existing description = 'no manipulation besides what is already described in other fields' New description = 'No treatment was applied to the animal. Animal was housed at room temperature with a normal light cycle.'"] ... DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 4C acute' DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 4C acute' DoesNotExist: Could not find 'animal_treatment' protocol with name 'thermoneutrality chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'thermoneutrality chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 6C chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'cold 6C chronic' DoesNotExist: Could not find 'animal_treatment' protocol with name 'thermoneutrality chronic'
- [x] I should suggest submitting as an isocorr file here: FAILED: 13C-glycerol metabolomics_cor.xlsx CorrectedCompoundHeaderMissing: Compound header [Compound] not found in the accucor corrected data. Did you forget to provide --isocorr-format?
FAILED: col005c_neg_lowmz_corrected.xlsx MissingSamplesError: 3 samples are missing in the database: [col005c_HilicBlank2_scan1, col005c_HilicBlank3_scan1, col005c_LipidBlank_scan1]. Samples must be loaded prior to loading mass spec data.
FAILED: col005c_neg_lowmz_others_corrected.xlsx AssertionError: 21 compounds are missing.
DupeCompoundIsotopeCombos
error, the following ValidationError
s should be ignored:FAILED: col005c_plasma_hilic_corrected.xlsx DupeCompoundIsotopeCombos: The following duplicate compound/isotope combinations were found in the corrected data: [lactate & 0 on rows: 91,95; lactate & 1 on rows: 92,96; lactate & 2 on rows: 93,97; lactate & 3 on rows: 94,98] ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']}
FAILED: col005d_tissue WAT_corrected.xlsx DupeCompoundIsotopeCombos: The following duplicate compound/isotope combinations were found in the corrected data: [lactate & 0 on rows: 148,152; lactate & 1 on rows: 149,153; lactate & 2 on rows: 150,154; lactate & 3 on rows: 151,155] TypeError: argument of type 'NoneType' is not iterable
FAILED: col005e2_glucose tissues_f16bp_corrected.xlsx ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ValidationError: {'all': ['Mass spectrometry run with this Researcher, Date, Protocol and Sample already exists.']} ...
ValidationErrors
raised in PeakDataLabel.clean()
and TracerLabel.clean()
STATUS: FAILURE col005c_neg_lowmz_corrected.xlsx AggregatedErrors Summary (1 errors / 0 warnings): EXCEPTION1(ERROR): TypeError: argument of type 'NoneType' is not iterable
The above caught exception had a partial traceback: Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 927, in load_data peak_data_label.full_clean() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 1238, in full_clean raise ValidationError(errors) django.core.exceptions.ValidationError: {'count': ['Ensure this value is less than or equal to 20.']}
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1217, in load_accucor_data self.load_data() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 932, in load_data if self.is_a_downstream_dupe_error( File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1038, in is_a_downstream_dupe_error if row in self.dupe_isotope_rows[sheet] and ( TypeError: argument of type 'NoneType' is not iterable TypeError: argument of type 'NoneType' is not iterable
STATUS: FAILURE col005c_neg_lowmz_corrected.xlsx
AggregatedErrors Summary (1 errors / 0 warnings): EXCEPTION1(ERROR): TypeError: argument of type 'NoneType' is not iterable Buffered Error: File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/manage.py", line 22, in
main() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/manage.py", line 18, in main execute_from_command_line(sys.argv) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/management/commands/load_study.py", line 233, in handle call_command( File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/management/commands/load_accucor_msruns.py", line 142, in handle loader.load_accucor_data() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1233, in load_accucor_data self.aggregated_errors_object.buffer_error(e) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/exceptions.py", line 341, in buffer_error self.buffer_exception(exception, is_error=True, is_fatal=is_fatal) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/exceptions.py", line 304, in buffer_exception buffered_tb_str = self.get_buffered_traceback_string() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/exceptions.py", line 288, in get_buffered_traceback_string for step in traceback.format_stack() The above caught exception had a partial traceback: Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 927, in load_data peak_data_label.full_clean() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 1238, in full_clean raise ValidationError(errors) django.core.exceptions.ValidationError: {'count': ['Ensure this value is less than or equal to 20.']}
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1217, in load_accucor_data self.load_data() File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 932, in load_data if self.is_a_downstream_dupe_error( File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/utils/accucor_data_loader.py", line 1038, in is_a_downstream_dupe_error if row in self.dupe_isotope_rows[sheet] and ( TypeError: argument of type 'NoneType' is not iterable TypeError: argument of type 'NoneType' is not iterable
RequiredValuesError: Missing required values have been detected in the following columns: Researcher Name Collection Time Study Name Animal ID Animal Body Weight Age Sex Animal Genotype Feeding Status Diet Infusate Infusion Rate Tracer Concentrations Sample Name Note, entirely empty rows are allowed, but having a single value on a row in one sheet can cause a duplication of empty rows, so be sure you don't have stray single values in a sheet.
So it turns out that the MAX_LABELED_ATOMS has to be the max number of atoms in the compound with the most atoms that have a label applied. In the cold exposure data, that was fatty acid C34:1
, which has a formula of C34H66O2
. So I should:
MAX_LABELED_ATOMS
. It should catch this case:
ValidationError: {'count': ['Ensure this value is less than or equal to 30.']}
All of the todo items in the comment above are now complete. There remain a couple of housekeeping items I want to keep in mind as I move on to phase 5:
# TODO: This DOES change the database. See comments on issue #345
Validation page improvements 5:
orig_row_idx
doesn't necessarily indicate rows where we run into duplicates in the orig data. We run into duplicates in the corrected data that is processed in concert with the orig data, and in that context, we don't have the row index for the corrected data. It's being indexed by column name and filtering. I should probably try and skip that accucor data entirely or track it via compound and label count instead of index. Also note that the error is WRT the corrected data even though it's inside the original sheet conditionalNext level validation improvements
ExistingMSRun
to not list samples if all samples in common with another accucor/isocorr file account for all samples in the current accucor/isocorr filetissues_loader
New changes in #678 were previously overlooked. Re-opening.
FEATURE REQUEST
Inspiration
From learning about cross-database relation exceptions and the difficulty surrounding using multiple databases (i.e. the validation database) to handle load scripts with side-effects, we discover we have a choice to make on how to handle the validation of data without committing that data to the database.
Manually including
.using(db)
clauses introduces undesirable overhead, and turns out to have challenges relating to instance methods that make new queries (e.g. any method that makes a call likeModel.objects...
). Those fresh database queries will get the default database even though the object containing the method was created using a query to the validation database.In learning about how to handle this, it was discovered that you can override Django's default "database router" by creating your own derived class that overrides its 4 methods. This would make all the usages of
.using(db)
clauses unnecessary. All you would have to do is tell the router to direct queries to the validation database.However, the only reason we're using a second database in the first place is 2-fold:
There's also the issue of the load scripts stopping mid-load, causing the user to have to run the validation iteratively to learn of each issue 1 by 1, so either way, the load scripts need to be refactored.
Description
The choices for implementation are:
Alternatives
See the description. A choice needs to be made before this issue can proceed.
Dependencies
none
Comment
A PR with a quick fix is coming. There is a stack post that goes into more detail about the cross-db relations and possible solutions.
The database router solution should make it unnecessary to put calls to full_clean behind a conditional so that it only runs when the database is the default database.
ISSUE OWNER SECTION
Assumptions
none
Requirements
This will be broken down into phases per load script to produce PRs that are more readily consumable. Each phase should be accompanied by new and all-working tests.
--validate
mode).using(db)
,.save(using=db)
, andif self.db == settings.TRACEBASE_DB: rec.full_clean()
Limitations
none
Affected Components
DataRepo/management/commands/load*
DataRepo/utils/*load*
Tracebase/settings.py
DESIGN
Interface Change description
No changes in user experience other than receiving more errors in 1 run on the validation page. The --database and --validate options will go away on the command line. --debug will change to --dryrun. An option to disable auto-update will be added.
Code Change Description
The load scripts will be changed to buffer all errors in a self.errors array attribute. Different model loads will be encapsulated in methods, with required inputs (some of which may require successful prior model loads). If a required input is from a previous failed model load, the model load is skipped so as to not output useless errors. There will be a new requirement to report unrecognized (optional) headers as an error.
Tests
1.
Load scripts generate as many actionable errors as possible in 1 run - test that self.errors can contain multiple errors2.
If a load action in the script requires input involved in a previous error, skip that load action - test that (e.g.) a tissue error will not be followed by a sample error in the same run3.
Make it possible to control the verbosity of the loads - unnecessary test4.
Raise error on unknown headers - input a bad header for an optional column and check that an error about it is raised5.
Make it possible to defer mass auto-updates outside of the load scripts (so it can be run by either load_study or load_study_set) - Try deferring auto-updates using the new option and check that the buffer is populated after the run6.
Dry-run mode has no database side-effects (when not run in--validate
mode) - test unnecessary (see 6.1)6.1.
Wrap loading code in an atomic transaction - Run a successful load and check that nothing in the DB has changed7.
Validation page works on the main database - existing tests should suffice8.
Remove the validation database - tests unnecessary8.1.
Remove.using(db)
,.save(using=db)
, andif self.db == settings.TRACEBASE_DB: rec.full_clean()
- tests unnecessary8.2.
Revert database settings - tests unnecessary8.3.
Update docs - tests unnecessary9.
Add the ability to run load_study inside the validation interface instead of just individual loaders