Open hepcat72 opened 2 weeks ago
@mneinast - this is the idea I had that I mentioned to you at the retreat poster session. There are some things to work out in its design, but either way, I think this will be a more robust solution that relying on a specific directory structure that's not enforced...
This is really clever! It makes me wonder if this solution could help with a more general file organization effort outside of tracebase. Any time you have large files in a central location and you want the user to definitively tag them with information, this could be a nice way to ingest that info.
sticky parts: 1) where do we put this if the filename doesn't match?
2) how to deal with "empty" mzXML?...this is tricky but low priority since they do not have data.
3) problematic file organization?
FEATURE REQUEST
Inspiration
Issue #1263 implements a rather fragile and limited method of associating
mzXML
files withMSRunSequence
records by checking along the path of each mzXML file for all peak annotation files in a directory on that path. If it finds any, it checks that they are associated with the same sequence, and if so, it creates anMSRunSample
record using that sequence.We don't enforce any directory structure, and there are multiple ways this can otherwise fail. Curators can re-organize the directory structure to accommodate it, but the problem remains that it is labor intensive.
Last Tuesday, I was speaking with Michael at the retreat and I asked him if users had access to the
mzXML
files during the compilation of the study doc, and he said they did, so I suggested the following way to make the association during the submission building process...Description
Since we don't have access in a web form to directory paths to correctly fill out the
Peak Annotation Details
sheet, we can use javascript to compute their md5checksum to submit that and the filename (without submitting the actual file, which would be too much for a form like this).mzXML
file, we can recompute the checksum and associate it with a sequenceI'm on the fence about whether to record this in the
Peak Annotation Details
sheet, or create a separate sheet. Either way, this info should be hidden (whether it's a column or a sheet).We could probably best put it in the
Peak Annotation Details
sheet, because we created the sample name and we know what peak annotation file goes with it.The sticky parts are...
mzXML
to the correct row(/peak annotation file) and the user wouldn't be able to easily drag the right files to the form (which would involve lots of clicking and dragging):I'm not sure yet how to deal with the sticky parts, but if all else fails (since we don't actually need that for the "bucket of files" associated with a sample), we can just create extra rows for just the
mzXML
files and the sequence.Alternatives
Instead of md5, we could possibly parse like the first 10 or 20 lines to get the md5 of the raw file. It just depends on which is faster (md5 or parsing).
Dependencies
None
Comment
None
ISSUE OWNER SECTION
Assumptions
Limitations
Affected Components
Requirements
DESIGN
Interface Change description
None provided
Code Change Description
None provided
Tests