Document process to load studies with mzXML files

lparsons commented 5 months ago

FEATURE REQUEST

Inspiration

mzXML files are too large to place into a GitHub repository as we have done with previous study data. They are also cumbersome to move around and loading of study has become more complex.

Description

We need a documented process that multiple people understand and can follow to process and eventually load newly submitted studies that include mzXML files.

hepcat72 commented 5 months ago

Adding documentation is fair, though I would say that it really hasn't changed that much. In fact I would argue that the only additional complexity is "How do I supply the mzXML files?". And that's got a really straightforward answer: the --mzxml-files option or the mzxml_files yaml item.

And there is documentation in the form of the yaml schema and the commands' help output, which honestly, I think should largely suffice, so a perception of a lack of dcumentation shouldn't really hamper any progress. I have mentioned on a few occasions that all the changes implemented require (with the exception of optionally supplying mzxml files) no different usage of any of the commands unless you get an error - and the errors pretty much tell you what you need to do. Each commands' usage contains sufficient information to add context to the yaml settings which can inform a user how to supply the files.

Script by script, we have:

load_study_set - The CLI has not changed.
load_study - The CLI has not changed.

load_animals_and_samples - The CLI has a single new optional option, though it is unnecessary unless you get an error. Setting the lcms_file in the yaml will automatically provide it to both the animal/sample script and the accucor script, and again, you will not need to provide that file in the most common cases (unless you get an error).

$ python manage.py load_animals_and_samples -h
...
--lcms-file LCMS_FILE
                    Excel or tab-delimited file containing metadata associated with the liquid chromatography and mass spec instrument run, (e.g. DataRepo/data/tests/small_obob_lcms_metadata/glucose.xlsx). If an excel file is used, it will use the sheet named 'LCMS Metadata' or
                    the first sheet.

The CLI for the load_accucor_msruns has 6 new options that are fairly straightforward, but only the --mzxml-files option is necessary to add mzXML files. Each option has a representation in the yaml schema. The mzXML files will get automatically matched to the sample data headers in the accucor file.

$ python manage.py load_accucor_msruns -h
...
--lcms-file LCMS_FILE
                    Filepath of either an xlsx or csv file containing metadata associated with the liquid chromatography and mass spec instrument run.
--mzxml-files [MZXML_FILES ...]
                    Filepaths of mzXML files containing instrument run data.
--lc-protocol-name LC_PROTOCOL_NAME
                    Default LCMethod.name of the liquid chromatography protocol used. Used if --lcms-file is not supplied, or specifies no LC info for a sample.
--instrument INSTRUMENT
                    Default name of the LCMS instrument that analyzed the samples. Used if --lcms-file is not supplied, or specifies no instrument for a sample.
--polarity POLARITY   Default ion mode of the LCMS instrument that analyzed the samples. Used if --lcms-file is not supplied, or specifies no polarity for a sample.
--mz-min MZ_MIN       Default unsigned minimum charge of the MSRun scan range. Only required if a study contains multiple MSRuns with the same polarity. Automatically parsed from mzXML. If unavailable, the minimum medMz value from the accucor/isocorr file is acceptable.
--mz-max MZ_MAX       Default unsigned maximum charge of the MSRun scan range. Only required if a study contains multiple MSRuns with the same polarity. Automatically parsed from mzXML. If unavailable, the maximum medMz value from the accucor/isocorr file is acceptable.

The other loaders simple had their input file option names changes to --infile.

Let me know what additional information I can provide.

lparsons commented 5 months ago

Thanks for the info on the options. That will be helpful when putting together the documentation. To close this issue , we need to document a process for handling incoming submissions, staging the data, communicating with the researcher, and finally loading into production. Basically, updating our internal docs here: https://nplcadmindocs.princeton.edu/index.php/TraceBase#Processing_TraceBase_Study_Submissions

lparsons commented 5 months ago

Putting some notes here:

Ideal to ask for sample sheet, a single accucor file, and all mzXML files in one directory.
Download files to staging area on tracebase-dev
Load sample sheet, fix errors, reload until success
Use dry-run to load accucor file and mzXML files (should not be copied during dry run), fix errors, repeat until all issues resolved
Compile YAML file with necessary options, test loading in dev (can this be done after it's already loaded?), mark as ready to load
Use YAML file to load into production from staging area, mark as loaded
Script to cleanup marked directories and notify about old ones (over 1 week?)

lparsons commented 5 months ago

Update process ready for review/testing at https://nplcadmindocs.princeton.edu/index.php/TraceBase#Processing_TraceBase_Study_Submissions

Princeton-LSI-ResearchComputing / tracebase