fburic / candia

Canonical Decomposition of Data-Independent-Acquired Spectra
Other
5 stars 2 forks source link

Identification and quantitation per sample #11

Closed MiguelCos closed 3 years ago

MiguelCos commented 3 years ago

Hello Filip,

I wanted to open a different issue for this question.

We are very happy with candia's capabilities so far and would probably be testing it soon with a bigger cohort of samples.

What I find interesting about this spectral decomposition is the ability to run 'classical' searches on the decomposed spectra.

One of the things I am interested in the detection of sequence variants, either by database search or by detecting non-annotated point mutations via xtandem or similar.

Since the candia's output is a single mzXML file, how do you think it would be possible to assign the identification (and therefore, potential quantitation) of a peptide to a particular sample/condition based on this unique mzXML file?

I understand that the quantitation can be performed via DIANN, but probably my ignorance regarding its use (I have not used it extensively) is not allowing me to understand how to differentiate identification between samples after having the decomposed spectra.

Would you have any ideas on how to go from the decomposed candia's spectra into xtandem for the identification of point-mutations not found in the fasta file, and then use this information for differential quatitation between samples?

As always, many thanks for your input.

Best wishes, Miguel

fburic commented 3 years ago

Hi Miguel, I'm very happy you're finding candia useful!

Currently, the single output mzXML from the pipeline does not contain direct information on assigning spectra to input sample scans, but this can be retrieved using the sample mode of the decomposition. The PARAFAC decomposition "extracts" all deconvolved spectra that appear across all input scans. This is the "mass mode". The "sample mode" of the decomposition captures the relative contribution of a spectrum to each input scan (this mode is a single vector per spectrum, containing num_scans values). Thus, the sample mode should record a zero for input samples where the analyte was not present. See the red component in Fig 1B of the paper, for example: it has zero and near-zero abundance in samples 2 and 4, respectively, meaning that analyte would be assumed to be absent. More details in the Supplemental Information (page 18)

In practice, this proved to be too imprecise for high-quality quantification, however. This is why we left the job of finding these deconvolved spectra within each input scan to the classical searching software. But the sample mode should pick up very large differences, including present / absent. We haven't looked at data specifically for present/absent scenarios, but it's worth trying out. Unfortunately, this side of the pipeline is more for development, so requires a little extra effort in reading and matching the sample modes. I wrote the steps for this below, under CANDIA Method.

We stopped using xtandem early in development so I'm not sure exactly how to use it for this purpose, but I think most search software will report the "scan" value (the spectrum unique ID in the mzXML) for a peptide-spectrum match. This can be used to inspect the sample mode (abundance) values for those spectra that are found.

Alternatively, DIA-NN can be used for this as well, with a CANDIA library (produced from the single mzXML outptut and a FASTA), under the same assumption that the library contains all expected entries, but which may not appear in every input sample. So then DIA-NN should simply fail to find e.g. sequence variances in the input sample scans where they're not present.

CANDIA Method

There is a script for extracting the time modes of the best models, though the output requires programmatic access currently. The script is run as:

singularity exec candia.sif python scripts/quantification/collect_sample_modes.py --config ${configfile}

This will write a file called sample_modes_best_models.feather to the experiment directory. The feather format (a table) can be read with the Python pyarrow library or the R arrow library

To match the spectra and sample modes, you can use the spectrum_index.feather table automatically created by CANDIA in an earlier step to track all spectra, across all models. Here is Python code for this, though the R code looks very similar:

In [1]: import pyarrow.feather as feather
In [2]: import pandas as pd

In [3]: sample_modes = feather.read_feather('sample_modes_best_models.feather')
In [4]: sample_modes.head(4)
Out[4]:
   sample_num  comp_num     abundance  model_id  cv_sample_mode
0           0         0  3.575857e-02        81        1.616704
1           1         0  2.327982e-02        81        1.616704
2           2         0  9.809089e-45        81        1.616704
3           3         0  4.203895e-45        81        1.616704

In [5]: spectrum_index = feather.read_feather('spectrum_index.feather')
In [6]: spectrum_index.head(4)
Out[6]:
   swath_start  rt_window  ncomp  model_id  spectrum_num  scan
0        40000          0     10         0             0     0
1        40000          0     10         0             1     1
2        40000          0     10         0             2     2
3        40000          0     10         0             3     3

Note that scan is a globally unique ID given by CANDIA for the sake of downstream analysis software. It's the ID that appears in the mzXML file as <scan>. Some search software also list this ID. The spectrum_num in the spectrum_index.feather is actually just the component number within the model_id for that particular spectrum, so we can make a table join like:

In [14]: spectra_with_samples = pd.merge(sample_modes, spectrum_index, left_on=['model_id', 'comp_num'], right_on=['model_id', 'spectrum_num'])

In [15]: spectra_with_samples.head(4)
Out[15]:
   sample_num  comp_num     abundance  model_id  cv_sample_mode  swath_start  rt_window  ncomp  spectrum_num  scan
0           0         0  3.575857e-02        81        1.616704        40000          0     91             0  4050
1           1         0  2.327982e-02        81        1.616704        40000          0     91             0  4050
2           2         0  9.809089e-45        81        1.616704        40000          0     91             0  4050
3           3         0  4.203895e-45        81        1.616704        40000          0     91             0  4050

Now we have a single table tracing scan (the spectrum ID in the mzXML file) to sample_num and the correspondingabundance. These 3 columns can be saved to a CSV:

spectra_with_samples[['scan', 'sample_num', 'abundance']].to_csv('spectra_with_sample_abundance.csv', index=False)

Hope this helps but I can also look into this more closely sometime next week. Also, I can implement a specific use-case (i.e. not very generic) fairly quickly, if it looks useful for you (like e.g. automating the above and similar).

Best, Filip

MiguelCos commented 3 years ago

Hello Filip,

Wow, truly many thanks for this thorough explanation! It is now quite clear to me that it is possible to use the scan numbers from the PSMs from any search against the mzXML from Candia in order to map peptide IDs to their corresponding samples using the sample modes that can be extracted from within the candia pipeline. I am more experienced in R but this mapping should be easily done there as well.

A quick question(s): at which stage of the pipeline would these sample models be able for extraction? Is this already implemented? I would definitely test it.

I would say it would be awesome if the pipeline automatically throws this spectra_with_samples table. Then going from this to the PSM list with the scan numbers per sample should be even easier. So, yes, I find it would be definitely useful if you have time to implement it.

Best wishes, Miguel

fburic commented 3 years ago

Hi Miguel,

You're very welcome! Yes, exactly! And the R code (at least with dplyr/tidyverse) is even simpler, I just got biased by the source code :)

The sample modes are created already at the decompositions stage (step 5), along with the mass modes, and the time modes (these last ones give the elution profile and time for each deconvolved component, though this is not saved in the mzXML as it turned out to not be needed for searching). All these are stored in the _F*.pt model files generated in step 5. The script I mentioned is used to extract the data from the *.pt files

The sample_index.feather is created in step 6.

Cool, ok! I opened #13 I'll try to find some time this week.

Best wishes, Filip

fburic commented 3 years ago

Hi Miguel,

Sorry for the delay, some urgent matters came up last week.

I have implemented my suggestion above as #13 and documented the new functionality as the new step 8: https://github.com/fburic/candia#8-optional-collect-the-abundance-sample-mode-value-of-decomposed-spectra

I have integrated the CSV export into the collect_sample_modes.py script I mentioned above and added this step to the candia main script, so it will be run as part of the pipeline (the step is marked as "optional" only for step-by-step operation).

See the documentation for running it, but it is the same as above, except no need for manual table joining :) Note the new pipeline config parameter spectra_with_sample_abundance_file that specifies the output CSV file name. I updated the test experiment config file with this as well.

Please let me know if you have any trouble with it. I used some of my old result files to test, since I did not have time to improve the test data in this repo.

Best wishes, Filip

MiguelCos commented 3 years ago

Hello Filip,

Many many thanks for adding this feature!

We will be testing it in the next few days, probably with a bigger cohort and let you know how it goes.

I am closing this issue now but would open it again if there is anything new regarding it.

Best wishes, Miguel