Javascript drag and drop interface for mzXML files that creates the md5checksum and populates a sheet

hepcat72 commented 1 month ago

FEATURE REQUEST

Inspiration

Issue #1263 implements a rather fragile and limited method of associating mzXML files with MSRunSequence records by checking along the path of each mzXML file for all peak annotation files in a directory on that path. If it finds any, it checks that they are associated with the same sequence, and if so, it creates an MSRunSample record using that sequence.

We don't enforce any directory structure, and there are multiple ways this can otherwise fail. Curators can re-organize the directory structure to accommodate it, but the problem remains that it is labor intensive.

Last Tuesday, I was speaking with Michael at the retreat and I asked him if users had access to the mzXML files during the compilation of the study doc, and he said they did, so I suggested the following way to make the association during the submission building process...

Description

Since we don't have access in a web form to directory paths to correctly fill out the Peak Annotation Details sheet, we can use javascript to compute their md5checksum to submit that and the filename (without submitting the actual file, which would be too much for a form like this).

When on the submission start page, each form row for a peak annotation file can have a drop-area for mzXML files (associated with the sequence metadata they entered).
Users can drag and drop a bolus of files that are from the annotated sequence.
Instead of attaching the file, a javascript method can calculate the md5checksum for each file and record that (with the filename) for submission
Upon submission, we can autofill a sheet with this information so that with every mzXML file, we can recompute the checksum and associate it with a sequence

I'm on the fence about whether to record this in the Peak Annotation Details sheet, or create a separate sheet. Either way, this info should be hidden (whether it's a column or a sheet).

We could probably best put it in the Peak Annotation Details sheet, because we created the sample name and we know what peak annotation file goes with it.

The sticky parts are...

Where do we put this data if the file name doesn't match the sample name...? I guess we can create extra rows.
What about these "empty" mzXML files? There can be multiple files with the same name due to those "empty" files with no scan tags. Maybe I can filter those out using javascript...
If the user has organized the files like this, we won't be able to assign the correct mzXML to the correct row(/peak annotation file) and the user wouldn't be able to easily drag the right files to the form (which would involve lots of clicking and dragging):
```
study/
accucor_pos.xlsx
accucor_neg_low.xlsx
accucor_neg_high.xlsx
mzxmls/
    sample1/
        sample1.mzXML (pos)
        scan2/
            sample1.mzXML (neg low)
        scan3/
            sample1.mzXML (neg high)
    sample2/
        ...
```

I'm not sure yet how to deal with the sticky parts, but if all else fails (since we don't actually need that for the "bucket of files" associated with a sample), we can just create extra rows for just the mzXML files and the sequence.

Alternatives

Instead of md5, we could possibly parse like the first 10 or 20 lines to get the md5 of the raw file. It just depends on which is faster (md5 or parsing).

Dependencies

None

Comment

None

ISSUE OWNER SECTION

Assumptions

List of assumptions that the code will not explicitly address/check
E.g. We will assume input is correct (explaining why there is no validation)

Limitations

A list of things this work will specifically not do
E.g. This feature will only handle the most frequent use case X

Affected Components

change: File path or DB table ...
add: Environment variable or server setting
delete: External executable or cron job

Requirements

[ ] 1. List of numbered conditions to be met for the feature
[ ] 2. E.g. Every column/row must display a value, i.e. cannot be empty
[ ] 3. Numbers for reference & checkboxes for progress tracking

DESIGN

Interface Change description

None provided

Code Change Description

None provided

Tests

[ ] 1. A description of at least one test for each requirement above.
[ ] 2. E.g. Test for req 2 that there's an exception when display value is ''
[ ] 3. Numbers for reference & checkboxes for progress tracking

hepcat72 commented 1 month ago

@mneinast - this is the idea I had that I mentioned to you at the retreat poster session. There are some things to work out in its design, but either way, I think this will be a more robust solution that relying on a specific directory structure that's not enforced...

mneinast commented 1 month ago

This is really clever! It makes me wonder if this solution could help with a more general file organization effort outside of tracebase. Any time you have large files in a central location and you want the user to definitively tag them with information, this could be a nice way to ingest that info.

sticky parts: 1) where do we put this if the filename doesn't match?

could this be a sign of a mistake / could we give an error?

2) how to deal with "empty" mzXML?...this is tricky but low priority since they do not have data.

3) problematic file organization?

this is how I've organized a lot of my files in the past, and I suspect it's pretty common. I guess there isn't much difference between reorganizing them on msdata and doing the drag/drop trick? One advantage could be that we allow the user to actively make this connection while they are submitting the data, rather than waiting on a dev to load the study to confirm the connection. It's hard to say how much clicking / how annoying this would be without testing it out...will need to consider / discuss more.

Princeton-LSI-ResearchComputing / tracebase