Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Javascript drag and drop interface for mzXML files that creates the md5checksum and populates a sheet #1267

Open hepcat72 opened 1 month ago

hepcat72 commented 1 month ago

FEATURE REQUEST

Inspiration

Issue #1263 implements a rather fragile and limited method of associating mzXML files with MSRunSequence records by checking along the path of each mzXML file for all peak annotation files in a directory on that path. If it finds any, it checks that they are associated with the same sequence, and if so, it creates an MSRunSample record using that sequence.

We don't enforce any directory structure, and there are multiple ways this can otherwise fail. Curators can re-organize the directory structure to accommodate it, but the problem remains that it is labor intensive.

Last Tuesday, I was speaking with Michael at the retreat and I asked him if users had access to the mzXML files during the compilation of the study doc, and he said they did, so I suggested the following way to make the association during the submission building process...

Description

Since we don't have access in a web form to directory paths to correctly fill out the Peak Annotation Details sheet, we can use javascript to compute their md5checksum to submit that and the filename (without submitting the actual file, which would be too much for a form like this).

  1. When on the submission start page, each form row for a peak annotation file can have a drop-area for mzXML files (associated with the sequence metadata they entered).
  2. Users can drag and drop a bolus of files that are from the annotated sequence.
  3. Instead of attaching the file, a javascript method can calculate the md5checksum for each file and record that (with the filename) for submission
  4. Upon submission, we can autofill a sheet with this information so that with every mzXML file, we can recompute the checksum and associate it with a sequence

I'm on the fence about whether to record this in the Peak Annotation Details sheet, or create a separate sheet. Either way, this info should be hidden (whether it's a column or a sheet).

We could probably best put it in the Peak Annotation Details sheet, because we created the sample name and we know what peak annotation file goes with it.

The sticky parts are...

  1. Where do we put this data if the file name doesn't match the sample name...? I guess we can create extra rows.
  2. What about these "empty" mzXML files? There can be multiple files with the same name due to those "empty" files with no scan tags. Maybe I can filter those out using javascript...
  3. If the user has organized the files like this, we won't be able to assign the correct mzXML to the correct row(/peak annotation file) and the user wouldn't be able to easily drag the right files to the form (which would involve lots of clicking and dragging):
    study/
    accucor_pos.xlsx
    accucor_neg_low.xlsx
    accucor_neg_high.xlsx
    mzxmls/
        sample1/
            sample1.mzXML (pos)
            scan2/
                sample1.mzXML (neg low)
            scan3/
                sample1.mzXML (neg high)
        sample2/
            ...

I'm not sure yet how to deal with the sticky parts, but if all else fails (since we don't actually need that for the "bucket of files" associated with a sample), we can just create extra rows for just the mzXML files and the sequence.

Alternatives

Instead of md5, we could possibly parse like the first 10 or 20 lines to get the md5 of the raw file. It just depends on which is faster (md5 or parsing).

Dependencies

None

Comment

None


ISSUE OWNER SECTION

Assumptions

  1. List of assumptions that the code will not explicitly address/check
  2. E.g. We will assume input is correct (explaining why there is no validation)

Limitations

  1. A list of things this work will specifically not do
  2. E.g. This feature will only handle the most frequent use case X

Affected Components

Requirements

DESIGN

Interface Change description

None provided

Code Change Description

None provided

Tests

hepcat72 commented 1 month ago

@mneinast - this is the idea I had that I mentioned to you at the retreat poster session. There are some things to work out in its design, but either way, I think this will be a more robust solution that relying on a specific directory structure that's not enforced...

mneinast commented 1 month ago

This is really clever! It makes me wonder if this solution could help with a more general file organization effort outside of tracebase. Any time you have large files in a central location and you want the user to definitively tag them with information, this could be a nice way to ingest that info.

sticky parts: 1) where do we put this if the filename doesn't match?

2) how to deal with "empty" mzXML?...this is tricky but low priority since they do not have data.

3) problematic file organization?