Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Match mzXML files that use dashes (`-`) #991

Closed lparsons closed 1 month ago

lparsons commented 3 months ago

FEATURE REQUEST

Inspiration

It appears that Maven/El-Maven renames samples that include dashes (-) in the filenames and uses underscores (_) in the sample headers. This can be seen in tracebase-dev.princeton.edu:/tracebase-staging/incoming/col013a_perturbative_infusions/

Description

Matching filenames to samples using either dashes (-) or underscores (_) would simplify the loading process.

Alternatives

Creating a table to match samples to files explicitly is a reasonable workaround to this issue. I'm not sure which solution is preferable.

Dependencies

Comment

We don't necessarily need to change the code to address this problem, but we do need a documented process to handle the issue and get the datasets loaded.


ISSUE OWNER SECTION

Assumptions

  1. List of assumptions that the code will not explicitly address/check
  2. E.g. We will assume input is correct (explaining why there is no validation)

Limitations

  1. A list of things this work will specifically not do
  2. E.g. This feature will only handle the most frequent use case X

Affected Components

Requirements

DESIGN

Interface Change description

None provided

Code Change Description

None provided

Tests

lparsons commented 3 months ago

@hepcat72 Can you help me sort out what additional worksheet, etc. I could create to get these files loaded without having to search/replace underscores with dashes? Or perhaps it would just be simpler to patch the matching code?

hepcat72 commented 3 months ago

@hepcat72 Can you help me sort out what additional worksheet, etc. I could create to get these files loaded without having to search/replace underscores with dashes? Or perhaps it would just be simpler to patch the matching code?

Interesting. I think fixing the matching code is a good ultimate solution. RN, it uses an exact match of the basename (without the extension) to match (exactly) the header in order to pair the header with the file. If you know the pairs, you can supply either a tsv or excel (with a sheet named "Peak Annotation Details") to supply the mapping between sample, header, and mzXML. When the submission refactor is done, it will be auto-populated (aside from the mzXML), so it's annoying at the moment, but was written to handle resolving any case, and this is a supported case. It has 4 columns (when the header doesn't match the file's base name): sample name, header name, mzxml file name, and annotation file name. A 5th "sequence name" column might currently be necessary*.

It's probably best to just show you by example. Download this excel file and look at the "Peak Annotation Details" sheet. That's what you would need. Then you just supply it to python manage.py msruns_loader --infile ****HERE**** --mzxml-files *.mzXML.

* I'd like the command line defaults to be more tightly integrated with the "defaults sheet" so that you wouldn't have to enter the sequence name column.

hepcat72 commented 2 months ago

@lparsons - I was working on this today, and just as I was about to write a test for the new code, I was looking at the test mzXML files I already had and they had dashes in them, so I decided to double-check the conclusion that it was Maven/El-Maven that was doing the swap of dashes with underscores, and discovered that the accucor files had dashes in the headers as well... so I am questioning whether it is El Maven/Maven that is doing this. I noted that the example data you linked appears to be isocorr (or isoautocorr?). Could it be that software that did the dash swap?

hepcat72 commented 2 months ago

OK. It does appear to be isocorr that did this, not Maven/El-Maven.

lparsons commented 2 months ago

@hepcat72 I wasn't able to confirm which piece of software changes the sample names, and it could be multiple ones that do.