Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Catch missing sequence issues before ArchiveFile records created #976

Closed hepcat72 closed 3 months ago

hepcat72 commented 4 months ago

FEATURE REQUEST

Inspiration

While fixing a bug in the MSRunsLoader, we ended up with archive files that were created and never linked to existing records. The theory is that these were created when there was a typo in the sequence information provided. The script failed when it tried to retrieve the sequence record and everything was rolled back. However, as we learned, deleting records (or rolling back from an atomic transaction), files created by FileField/FieldFile are not deleted.

Description

We should check that the sequence records exist before creating any ArchiveFile records and skip their creation if it doesn't exist.

Alternatives

Another option we discussed, which I personally think would be a good additional step to take (because there are other cases where a downstream exception unrelated to sequence records could still happen, cause a rollback, and leave vestigial files), would be to override the delete() method to clean up.

Dependencies

None

Comment

Some notes from various shell commends when handling this issue:

In [1]: from DataRepo.models import ArchiveFile

In [2]: from DataRepo.utils.file_utils import string_to_datetime

In [3]: dt = string_to_datetime("2024-05-27")

In [4]: dt
Out[4]: datetime.datetime(2024, 5, 27, 0, 0)

In [11]: qs = ArchiveFile.objects.filter(imported_timestamp__gt=dt).exclude(filename__contains="Dupe")

In [12]: qs.count()
Out[12]: 12

In [13]: for rec in qs.all():
    ...:     print(model_to_dict(rec), rec.imported_timestamp)

{'id': 533, 'filename': 'exp027f3_01.mzXML', 'checksum': '07954a74918e5a4ec9e8b08036e2f6a477922fa1', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_01.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 534, 'filename': 'exp027f3_02.mzXML', 'checksum': '9e432cbcd868458175e66be384e8ef5a84597b73', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_02.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 535, 'filename': 'exp027f3_03.mzXML', 'checksum': '698a357b3aa8dbc37f15d9bcab2eb8cfbcc503a7', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_03.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 536, 'filename': 'exp027f3_04.mzXML', 'checksum': 'fb62f301b8f0c3b4d50826d5712838bae47b3d16', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_04.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 537, 'filename': 'exp027f3_05.mzXML', 'checksum': 'bb037814a0a364f268dad7e3c80cc71b432b4faf', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_05.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 538, 'filename': 'exp027f3_06.mzXML', 'checksum': '929e58fe5c1f173be17f7e6d1d8768968fc2b2a9', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_06.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 545, 'filename': 'exp027f3_01.mzXML', 'checksum': '3a419269f03a9af4fdf768e2ac1e9bda0f8fc343', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_01_J3ITuAA.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 546, 'filename': 'exp027f3_02.mzXML', 'checksum': 'dda48dba38a7bd62da03ca95f80e53790b3e7601', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_02_xc5h6BI.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 547, 'filename': 'exp027f3_03.mzXML', 'checksum': '63f9555bb4eb8b219e695c3b04dc0157a9913e8c', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_03_5faBFCl.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 548, 'filename': 'exp027f3_04.mzXML', 'checksum': '92cefe34f5e6fe4a5c9cea3fabeeaccf1b35ef48', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_04_8e809Ru.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 549, 'filename': 'exp027f3_05.mzXML', 'checksum': 'a47d0175aa9a084eecf70ceef6d442e2d36cd7af', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_05_QHAGaAd.mzXML>, 'data_type': 2, 'data_format': 4}
{'id': 550, 'filename': 'exp027f3_06.mzXML', 'checksum': '5b1343efdacd61213a590d9918cb52aa9eca421a', 'file_location': <FieldFile: archive_files/2024-05/ms_data/exp027f3_06_ukEP2sG.mzXML>, 'data_type': 2, 'data_format': 4}

In [17]: mqs = MSRunSample.objects.filter(ms_data_file__id__in=qs.values_list("id", flat=True))

In [18]: mqs.count()
Out[18]: 12

In [19]: for rec in mqs.all():
    ...:     print(model_to_dict(rec), "Peak Groups:", rec.peak_groups.count())
    ...: 
{'id': 4198, 'msrun_sequence': 47, 'sample': 3802, 'polarity': 'negative', 'mz_min': 646.981750488281, 'mz_max': 656.978088378906, 'ms_raw_file': 179, 'ms_data_file': 521} Peak Groups: 0
{'id': 4204, 'msrun_sequence': 47, 'sample': 3802, 'polarity': 'negative', 'mz_min': 772.991760253906, 'mz_max': 782.974731445313, 'ms_raw_file': 179, 'ms_data_file': 527} Peak Groups: 0
{'id': 4199, 'msrun_sequence': 47, 'sample': 3807, 'polarity': 'negative', 'mz_min': 646.981384277344, 'mz_max': 656.978881835938, 'ms_raw_file': 181, 'ms_data_file': 522} Peak Groups: 0
{'id': 4205, 'msrun_sequence': 47, 'sample': 3807, 'polarity': 'negative', 'mz_min': 772.993591308594, 'mz_max': 782.973083496094, 'ms_raw_file': 181, 'ms_data_file': 528} Peak Groups: 0
{'id': 4200, 'msrun_sequence': 47, 'sample': 3812, 'polarity': 'negative', 'mz_min': 646.981323242188, 'mz_max': 656.976989746094, 'ms_raw_file': 183, 'ms_data_file': 523} Peak Groups: 0
{'id': 4206, 'msrun_sequence': 47, 'sample': 3812, 'polarity': 'negative', 'mz_min': 772.989685058594, 'mz_max': 782.961364746094, 'ms_raw_file': 183, 'ms_data_file': 529} Peak Groups: 0
{'id': 4201, 'msrun_sequence': 47, 'sample': 3817, 'polarity': 'negative', 'mz_min': 646.982788085938, 'mz_max': 656.977172851563, 'ms_raw_file': 185, 'ms_data_file': 524} Peak Groups: 0
{'id': 4207, 'msrun_sequence': 47, 'sample': 3817, 'polarity': 'negative', 'mz_min': 772.986877441406, 'mz_max': 782.968200683594, 'ms_raw_file': 185, 'ms_data_file': 530} Peak Groups: 0
{'id': 4202, 'msrun_sequence': 47, 'sample': 3822, 'polarity': 'negative', 'mz_min': 646.980712890625, 'mz_max': 656.976257324219, 'ms_raw_file': 187, 'ms_data_file': 525} Peak Groups: 0
{'id': 4208, 'msrun_sequence': 47, 'sample': 3822, 'polarity': 'negative', 'mz_min': 772.979675292969, 'mz_max': 782.974365234375, 'ms_raw_file': 187, 'ms_data_file': 531} Peak Groups: 0
{'id': 4203, 'msrun_sequence': 47, 'sample': 3827, 'polarity': 'negative', 'mz_min': 646.981506347656, 'mz_max': 656.975524902344, 'ms_raw_file': 189, 'ms_data_file': 526} Peak Groups: 0
{'id': 4209, 'msrun_sequence': 47, 'sample': 3827, 'polarity': 'negative', 'mz_min': 772.975830078125, 'mz_max': 782.97216796875, 'ms_raw_file': 189, 'ms_data_file': 532} Peak Groups: 0

In [20]: for rec in mqs.all():
    ...:     print(model_to_dict(rec), "Peak Groups:", rec.peak_groups.count())
    ...:     rec.delete()

In [21]: mqs = MSRunSample.objects.filter(ms_data_file__id__in=qs.values_list("id", flat=True))

In [22]: mqs.count()
Out[22]: 0

console:

5d581afd7cd3db7096d7d8b646e2a608  exp027f3_01_J3ITuAA.mzXML
9bf58a7b64d2c688c4eede6ba6de33e4  exp027f3_01.mzXML
5d581afd7cd3db7096d7d8b646e2a608  exp027f3_01_QbRBs5A.mzXML
1c5f311b5d838b5e6b647667074e4615  exp027f3_02.mzXML
d30548e37803bee2fed0ec21b539deb1  exp027f3_02_RtcaGhQ.mzXML
d30548e37803bee2fed0ec21b539deb1  exp027f3_02_xc5h6BI.mzXML
8dafc734c5e86d946c133edde1a59211  exp027f3_03_5faBFCl.mzXML
8dafc734c5e86d946c133edde1a59211  exp027f3_03_90qfGzH.mzXML
6e22b0873cedfa12d028c455b0f4f8c0  exp027f3_03.mzXML
a7d1a3c67767adf56093319ef63ef4ec  exp027f3_04_8e809Ru.mzXML
0861cba6b85861d62577e43c48c1b806  exp027f3_04.mzXML
a7d1a3c67767adf56093319ef63ef4ec  exp027f3_04_PRDjDYJ.mzXML
5b70aa3e978824f5ff03fc67e97187ed  exp027f3_05_IzC4W5i.mzXML
a4b9691d2467dfb2bc3a747948dbdb20  exp027f3_05.mzXML
5b70aa3e978824f5ff03fc67e97187ed  exp027f3_05_QHAGaAd.mzXML
7772a025d93cb471d919c1d70d7b3aef  exp027f3_06_hgMI0oo.mzXML
55148d19342b16fd0a5b5fa52a13137b  exp027f3_06.mzXML
7772a025d93cb471d919c1d70d7b3aef  exp027f3_06_ukEP2sG.mzXML

-rw-r--r--. 1 tracebase tracebase 823226 May 29 12:31 exp027f3_01_J3ITuAA.mzXML
-rw-r--r--. 1 tracebase tracebase 725829 May 29 12:24 exp027f3_01.mzXML
-rw-r--r--. 1 tracebase tracebase 823226 May 29 12:30 exp027f3_01_QbRBs5A.mzXML
-rw-r--r--. 1 tracebase tracebase 763276 May 29 12:24 exp027f3_02.mzXML
-rw-r--r--. 1 tracebase tracebase 846200 May 29 12:30 exp027f3_02_RtcaGhQ.mzXML
-rw-r--r--. 1 tracebase tracebase 846200 May 29 12:31 exp027f3_02_xc5h6BI.mzXML
-rw-r--r--. 1 tracebase tracebase 829891 May 29 12:31 exp027f3_03_5faBFCl.mzXML
-rw-r--r--. 1 tracebase tracebase 829891 May 29 12:30 exp027f3_03_90qfGzH.mzXML
-rw-r--r--. 1 tracebase tracebase 738030 May 29 12:24 exp027f3_03.mzXML
-rw-r--r--. 1 tracebase tracebase 762337 May 29 12:31 exp027f3_04_8e809Ru.mzXML
-rw-r--r--. 1 tracebase tracebase 684061 May 29 12:24 exp027f3_04.mzXML
-rw-r--r--. 1 tracebase tracebase 762337 May 29 12:31 exp027f3_04_PRDjDYJ.mzXML
-rw-r--r--. 1 tracebase tracebase 849755 May 29 12:31 exp027f3_05_IzC4W5i.mzXML
-rw-r--r--. 1 tracebase tracebase 751002 May 29 12:24 exp027f3_05.mzXML
-rw-r--r--. 1 tracebase tracebase 849755 May 29 12:31 exp027f3_05_QHAGaAd.mzXML
-rw-r--r--. 1 tracebase tracebase 814785 May 29 12:31 exp027f3_06_hgMI0oo.mzXML
-rw-r--r--. 1 tracebase tracebase 735417 May 29 12:24 exp027f3_06.mzXML
-rw-r--r--. 1 tracebase tracebase 814785 May 29 12:31 exp027f3_06_ukEP2sG.mzXML

ISSUE OWNER SECTION

Assumptions

  1. List of assumptions that the code will not explicitly address/check
  2. E.g. We will assume input is correct (explaining why there is no validation)

Limitations

  1. A list of things this work will specifically not do
  2. E.g. This feature will only handle the most frequent use case X

Affected Components

Requirements

DESIGN

Interface Change description

None provided

Code Change Description

None provided

Tests

hepcat72 commented 4 months ago

As I was working on this, I realized that this strategy of heading off the ArchiveFile creation by trying to front-load the code with anything that can go wrong so as to not create files on the file system is not a scalable strategy. Any exception after the creation of an ArchiveFile record, and subsequent rollback, will cause files to be left behind.

I did improve the msrun_sequence code, but the best solution IMO is to use a ~post_delete~ transaction.on_commit signal/method in the ArchiveFile model class.

lparsons commented 3 months ago

Just noting a specific case. If an mzXML file is specified on the command line and sample record does not exist, the archive file is copied over to the archive directory.

AggregatedErrors Summary (2 errors / 0 warnings):
        EXCEPTION1(ERROR): RecordDoesNotExist: Sample record matching the mzXML file's basename [blank1] does not exist.  Please identify the associated sample and add a row with it, the matching mzXML file name(s), and the Sequence Name to the Peak Annotation Details sheet/file.
        EXCEPTION2(ERROR): RecordDoesNotExist: Sample record matching the mzXML file's basename [blank2] does not exist.  Please identify the associated sample and add a row with it, the matching mzXML file name(s), and the Sequence Name to the Peak Annotation Details sheet/file.
Traceback (most recent call last):
  File "/var/www/tracebase/manage.py", line 22, in <module>
    main()
  File "/var/www/tracebase/manage.py", line 18, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/tracebase/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/usr/local/tracebase/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/tracebase/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/tracebase/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/var/www/tracebase/DataRepo/management/commands/load_table.py", line 398, in handle_wrapper
    raise self.saved_aes
  File "/var/www/tracebase/DataRepo/management/commands/load_table.py", line 382, in handle_wrapper
    retval = fn(self, *args, **options)
  File "/var/www/tracebase/DataRepo/management/commands/load_msruns.py", line 136, in handle
    self.load_data()
  File "/var/www/tracebase/DataRepo/management/commands/load_table.py", line 423, in load_data
    return self.loader.load_data(*args, **kwargs)
  File "/var/www/tracebase/DataRepo/loaders/table_loader.py", line 2015, in load_wrapper
    raise self.aggregated_errors_object
DataRepo.utils.exceptions.AggregatedErrors: 2 exceptions occurred, including type(s): [RecordDoesNotExist].
AggregatedErrors Summary (2 errors / 0 warnings):
        EXCEPTION1(ERROR): RecordDoesNotExist: Sample record matching the mzXML file's basename [blank1] does not exist.  Please identify the associated sample and add a row with it, the matching mzXML file name(s), and the Sequence Name to the Peak Annotation Details sheet/file.
        EXCEPTION2(ERROR): RecordDoesNotExist: Sample record matching the mzXML file's basename [blank2] does not exist.  Please identify the associated sample and add a row with it, the matching mzXML file name(s), and the Sequence Name to the Peak Annotation Details sheet/file.
Scroll up to see tracebacks for these exceptions printed as they were encountered.
$  ll /tracebasedev-archive/archive/archive_files/2024-06/ms_data/
total 172800
-rw-r--r--. 1 tracebase tracebase 75844569 Jun  4 16:21 blank1.mzXML
-rw-r--r--. 1 tracebase tracebase 77162873 Jun  4 16:21 blank2.mzXML
-rw-r--r--. 1 tracebase tracebase   729247 Jun  4 16:18 exp027f3_07.mzXML
-rw-r--r--. 1 tracebase tracebase   727344 Jun  4 16:18 exp027f3_08.mzXML
-rw-r--r--. 1 tracebase tracebase   712000 Jun  4 16:18 exp027f3_09.mzXML
-rw-r--r--. 1 tracebase tracebase   709946 Jun  4 16:18 exp027f3_10.mzXML
hepcat72 commented 3 months ago

Just noting a specific case. If an mzXML file is specified on the command line and sample record does not exist, the archive file is copied over to the archive directory.

This is taken care of in #983, which has not yet been merged. You could work off that branch? Branch: msruns_loader_rollback_cleanup.