MannLabs / alphapept

A modular, python-based framework for mass spectrometry. Powered by nbdev.
https://mannlabs.github.io/alphapept/
Apache License 2.0
168 stars 29 forks source link

Documentation: list of the separate alphapept.interface functions needed to run the complete workflow #496

Closed eliseneedham closed 2 years ago

eliseneedham commented 2 years ago

Hi,

Thanks for creating AlphaPept!

I'd like to run the modules separately as functions rather than alphapept.interface.run_complete_workflow() as I may create an alternate version of one of the modules. I thought this way may be simpler than running each of the python scripts but no worries if that is the way that I will need to do it.

I just can't figure out which alphapept.interface functions to run to replicate the workflow, i.e. I can't find all of the functions that match each module (outlined in the python scripts and jupyter notebook). For example, there is alphapept.interface.feature_finding() which I can see links up with the feature_finding.py module, but I am not sure which function to use for the chem.py module.

Could you please add a list of the separate functions that are needed to replicate the entire workflow and their order to the documentation?

Sorry if this information is already published and I have missed it.

Best wishes, Elise

straussmaximilian commented 2 years ago

Hi Elise,

The interface is centered around the settings, i.e., a dictionary with the parameters for your run. The run_complete_workflow() will run the different steps defined here. Conceptually, each step writes to an HDF container so we can call one function after another, and you could inject your customized workflow by defining a step. If you want, I could prepare a little notebook to showcase this. Could you elaborate a bit more on what you would like to achieve so I could come up with a meaningful example?

eliseneedham commented 2 years ago

Hi Maximilian,

Thanks for getting back so quickly!

I am hoping to replace the retention time alignment module with an approach that can handle some files that contain only MS1 scans (not ideal but I have been given a dataset with >hundreds of MS1 only files to match into a library with MS1 and MS2). I was looking to try use the pyopenms function MapAlignmentAlgorithmPoseClustering.

Please let me know if you need any other info.

Thank you!

Best wishes, Elise

straussmaximilian commented 2 years ago

Hi, so I checked a bit more in-depth. In principle, you could do the following:

One potential challenge could be how to load the data into the pyopenms. I found a tutorial here. In principle, the question would be how to get the data in there, as it seems to be relying on the XML format. Within AlphaPept, most of the data is stored in a tabular format. So you could try loading the XML format or maybe even re-implement the alignment algorithm.

eliseneedham commented 2 years ago

Thanks! That advice is very helpful.

Since the naming conventions for the features are slightly different between the PyOpenMS input and the AlphaPept output (from the search without alignment, matching and LFQ quantification you suggested), I'd just like to check with you that I am using the right input from Alphapept. Specifically, are there equivalent datasets from the feature_table hdf group to match:

  1. MZstart
  2. MZend
  3. quality
  4. intensity

If it is easier to conceptualise, I have attached an excel sheet comparing the columns for the PyOpenMS input with the datasets in the AlphaPept output.

Thank you!

Best wishes, Elise

AlphaPept_input_for_PyOpenMS.xlsx

straussmaximilian commented 2 years ago

Hi, partly. 1+2 For Bruker files, we use a feature finder from Bruker. This one has MZ_lower and MZ_upper, which would correspond to MZstart and MZend. For Thermo we don't report MZStart and MZend. If this would be useful, it would potentially be a small patch to have this in the report. 3: In general, we don't have a quality estimate. 4: The intensity would be ms1int[...]; this can be changed depending on the setting you use in the search. Default should be ms1_int_sum_apex. Below some text from the docs about this:


Lastly, we report three estimates for the intensity:

ms1_int_sum_apex: The intensity at the peak of the summed signal.
ms1_int_sum_area: The area of the summed signal
ms1_int_max_apex: The intensity at the peak of the most intense isotope envelope
ms1_int_max_area: The area of the the most intense isotope envelope```
straussmaximilian commented 2 years ago

I close this due to inactivity. Feel free to reopen in case.