Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Error reporting in file loader *might* be mis-tracking lines/data? #863

Closed jcmatese closed 8 months ago

jcmatese commented 8 months ago

BUG DESCRIPTION

Loading compounds to an existing/pre-loaded database is throwing errors because of differing compound names, but the line number to data content seems to be mismatched? (last line encountered may be overwriting prior offending data?)

Problem

executed `python manage.py load_compounds --infile tracebase-rabinowitz-data/compounds/compounds.tsv` The error is below. What you can see is that the "same" name(s) are throwing the "same" errors, but at different indexed lines. So `o-phosphohomoserine` != `O-phosphohomoserine` but claimed at rows 462-464, 492, 541-544. However, if you go to those lines, there are different data, and o-phosphohomoserine is just the last row.name encountered (example and full error, below). ![Screenshot 2024-02-19 at 1 42 12 PM](https://github.com/Princeton-LSI-ResearchComputing/tracebase/assets/6091114/9fb37528-ae08-40ab-88dd-f33b70b36bbf) ![Screenshot 2024-02-19 at 1 42 36 PM](https://github.com/Princeton-LSI-ResearchComputing/tracebase/assets/6091114/ab3ad1d3-fe8f-4f95-aee5-900813507902) ``` Compound records loaded: [0], skipped: [0], and errored: [8].CompoundSynonym records loaded: [0], skipped: [0], and errored: [0]. AggregatedErrors Summary (1 errors / 0 warnings): EXCEPTION1(ERROR): ConflictingValueErrors: Conflicting values encountered during loading: During the processing of row [462] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [463] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [464] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [492] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [541] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [542] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [543] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [544] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] Traceback (most recent call last): File "/Users/jcmatese/dev/tracebase/manage.py", line 22, in main() File "/Users/jcmatese/dev/tracebase/manage.py", line 18, in main execute_from_command_line(sys.argv) File "/Users/jcmatese/mambaforge/lib/python3.10/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line utility.execute() File "/Users/jcmatese/mambaforge/lib/python3.10/site-packages/django/core/management/__init__.py", line 436, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/Users/jcmatese/mambaforge/lib/python3.10/site-packages/django/core/management/base.py", line 412, in run_from_argv self.execute(*args, **cmd_options) File "/Users/jcmatese/mambaforge/lib/python3.10/site-packages/django/core/management/base.py", line 458, in execute output = self.handle(*args, **options) File "/Users/jcmatese/dev/tracebase/DataRepo/management/commands/load_table.py", line 194, in handle_wrapper raise self.saved_aes File "/Users/jcmatese/dev/tracebase/DataRepo/management/commands/load_table.py", line 178, in handle_wrapper retval = fn(self, *args, **options) File "/Users/jcmatese/dev/tracebase/DataRepo/management/commands/load_compounds.py", line 55, in handle self.load_data( File "/Users/jcmatese/dev/tracebase/DataRepo/management/commands/load_table.py", line 392, in load_data return self.loader.load_data() File "/Users/jcmatese/dev/tracebase/DataRepo/utils/loader.py", line 1005, in load_wrapper raise self.aggregated_errors_object DataRepo.utils.exceptions.AggregatedErrors: 1 exceptions occurred, including type(s): [ConflictingValueErrors]. AggregatedErrors Summary (1 errors / 0 warnings): EXCEPTION1(ERROR): ConflictingValueErrors: Conflicting values encountered during loading: During the processing of row [462] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [463] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [464] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [492] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [541] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [542] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [543] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] During the processing of row [544] in file [tracebase-rabinowitz-data/compounds/compounds.tsv]... Creation of the following Compound record(s) encountered conflicts: File record: {'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} Database record: {'id': 1234, 'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} [name] values differ: - database: [O-phosphohomoserine] - file: [o-phosphohomoserine] ``` ### Steps to reproduce 1. Numbered 2. Steps ### Current behavior

None provided

Expected behavior

None provided

Suggested Change

None provided

Comment

None


ISSUE OWNER SECTION

Assumptions

  1. List of assumptions made WRT the code
  2. E.g. We will assume input is correct (explaining why there is no validation)

Limitations

  1. A list of things this work will specifically not do
  2. E.g. This feature will only handle the most frequent use case X

Affected Components

  • change: File path or DB table ...
  • add: Environment variable or server setting
  • delete: External executable or cron job

Requirements

  • [ ] 1. List of numbered conditions to be met for the feature
  • [ ] 2. E.g. Every column/row must display a value, i.e. cannot be empty
  • [ ] 3. Numbers for reference & checkboxes for progress tracking

DESIGN

GUI Change description

None provided

Code Change Description

None provided

Tests

  • [ ] 1. A description of at least one test for each requirement above.
  • [ ] 2. E.g. Test for req 2 that there's an exception when display value is ''
  • [ ] 3. Numbers for reference & checkboxes for progress tracking
lparsons commented 8 months ago

Just a quick FYI, someone can go ahead and merge https://github.com/PrincetonUniversity/tracebase-rabinowitz-data/pull/98 if that will help workaround this issue while we work on getting it fixed.

jcmatese commented 8 months ago

What I mean to say is all those lines probably have the same category of aggregated error, but the reporting/association of the data might be incorrect?

jcmatese commented 8 months ago

Yes, I can workaround it by dropping my local database, but I just wanted to report it, as it will likely popup in all the loaders (if they all inherit/use that aggregated error reporting)

hepcat72 commented 8 months ago

The row numbers not being accurate is a known and documented minor issue that will be fixed in the current refactor. The error however is correct. It should also reduce the error summarization so that it doesn't spit out so many lines about the same issue.

The question is whether example data with different case has contaminated your database or whether the case difference is in the file you're loading.

hepcat72 commented 8 months ago

So does "O-phosphohomoserine" exist in your input file? If not, I suspect leftovers from a previous differing load.

jcmatese commented 8 months ago

o-phosphohomoserine is line 544 of that file. That is the correct error report for that line. The issues that the other lines are also reporting o-phosphohomoserine, but they should probably be reporting other data like

o-acetylserine  C5H9NO4 HMDB0003011 
o-cresol    C7H8O   HMDB0002055 
o-phosphoethanolamine   C2H8NO4P    HMDB0000224 

Presumably also with case differences...

hepcat72 commented 8 months ago

OK. That could be correct. We just spoke. To summarize, I think you may be right and the reported lines do have errors, but I think that it's just the summarization code that takes all the buffered ConflictingValueError objects and puts them into a single ConflictingValueErrors object. Thanks for the clarification.

A minor bug. Should be a quick fix. And I should be able to fix that shortly. But as long as we're aware of what's happening, you should be able to proceed.

And I suspect that perhaps this came about because you loaded data, edited it for case, and loaded again. That's when you would run into these conflicting value errors.

hepcat72 commented 8 months ago

I have this reporting error fixed. Found a few additional related minor bugs having to do with stats reporting.

New output will look like this:

DataRepo.utils.exceptions.AggregatedErrors: 2 exceptions occurred, including type(s): [ConflictingValueErrors, DuplicateValueErrors].
AggregatedErrors Summary (2 errors / 0 warnings):
    EXCEPTION1(ERROR): ConflictingValueErrors: Conflicting values encountered during loading:
    During the processing of file [/Users/rleach/Temporary/compounds_2wksago.tsv]...
    Creation of the following Compound record(s) encountered conflicts:
        File record:     {'name': 'M-aminobenzoic acid', 'formula': 'C7H7NO2', 'hmdb_id': 'HMDB0001891'} (on rows: 464)
        Database record: {'id': 911, 'name': 'm-aminobenzoic acid', 'formula': 'C7H7NO2', 'hmdb_id': 'HMDB0001891'}
            [name] values differ:
            - database: [m-aminobenzoic acid]
            - file:     [M-aminobenzoic acid]
        File record:     {'name': 'M-coumaric acid', 'formula': 'C9H8O3', 'hmdb_id': 'HMDB0001713'} (on rows: 465)
        Database record: {'id': 912, 'name': 'm-coumaric acid', 'formula': 'C9H8O3', 'hmdb_id': 'HMDB0001713'}
            [name] values differ:
            - database: [m-coumaric acid]
            - file:     [M-coumaric acid]
        File record:     {'name': 'M-cresol', 'formula': 'C7H8O', 'hmdb_id': 'HMDB0002048'} (on rows: 466)
        Database record: {'id': 913, 'name': 'm-cresol', 'formula': 'C7H8O', 'hmdb_id': 'HMDB0002048'}
            [name] values differ:
            - database: [m-cresol]
            - file:     [M-cresol]
        File record:     {'name': 'monoacylglycerol NA(22:4)', 'formula': 'C25H41O4Na', 'hmdb_id': 'FakeHMDB050'} (on rows: 494)
        Database record: {'id': 941, 'name': 'monoacylglycerol Na(22:4)', 'formula': 'C25H41O4Na', 'hmdb_id': 'FakeHMDB050'}
            [name] values differ:
            - database: [monoacylglycerol Na(22:4)]
            - file:     [monoacylglycerol NA(22:4)]
        File record:     {'name': 'O-acetylserine', 'formula': 'C5H9NO4', 'hmdb_id': 'HMDB0003011'} (on rows: 543)
        Database record: {'id': 990, 'name': 'o-acetylserine', 'formula': 'C5H9NO4', 'hmdb_id': 'HMDB0003011'}
            [name] values differ:
            - database: [o-acetylserine]
            - file:     [O-acetylserine]
        File record:     {'name': 'O-cresol', 'formula': 'C7H8O', 'hmdb_id': 'HMDB0002055'} (on rows: 544)
        Database record: {'id': 991, 'name': 'o-cresol', 'formula': 'C7H8O', 'hmdb_id': 'HMDB0002055'}
            [name] values differ:
            - database: [o-cresol]
            - file:     [O-cresol]
        File record:     {'name': 'O-phosphoethanolamine', 'formula': 'C2H8NO4P', 'hmdb_id': 'HMDB0000224'} (on rows: 545)
        Database record: {'id': 992, 'name': 'o-phosphoethanolamine', 'formula': 'C2H8NO4P', 'hmdb_id': 'HMDB0000224'}
            [name] values differ:
            - database: [o-phosphoethanolamine]
            - file:     [O-phosphoethanolamine]
        File record:     {'name': 'O-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'} (on rows: 546)
        Database record: {'id': 993, 'name': 'o-phosphohomoserine', 'formula': 'C4H10NO6P', 'hmdb_id': 'HMDB0003484'}
            [name] values differ:
            - database: [o-phosphohomoserine]
            - file:     [O-phosphohomoserine]

    EXCEPTION2(ERROR): DuplicateValueErrors: The following unique column(s) (or column combination(s)) were found to have duplicate occurrences on the indicated rows:
    file [/Users/rleach/Temporary/compounds_2wksago.tsv]
        Column(s) ['HMDB ID']
            HMDB0000143 (rows*: 227, 330)
            HMDB0000283 (rows*: 233, 606)
        Column(s) ['Synonyms']
            NA (rows*: 154, 158)
            C (rows*: 211, 220)

Scroll up to see tracebacks for these exceptions printed as they were encountered.

And I would like to point out that errors above lines that start with "AggregatedErrors Summary" are:

  1. The trace for the AggregatedErrors exception (the immediate trace above that line)
  2. All other errors are the relevant traces for each error contained in the "AggregatedErrors Summary", i.e. the trace at the time each error was buffered. Those traces only serve the utility of debugging the code. There's no reason to look at them if you are debugging erroneous data. All the relevant information should be contained in the summary, unless there is a bug.