Mismatch in peptide and protein data in ProteomicsLFQ mzTab output

jackrogan commented 1 year ago

Description of the bug

Output mzTab has mismatches between the peptide and protein sections of the data. For example, protein intensities listed for study-variable/protein combinations that have no peptide intensities at all, and vice-versa. The MSstats stage of the pipeline works fine, and the output matches the peptide table. I've seen this only in datasets for fractionated samples. I haven't seen it in non-fractionated datasets.

Command used and terminal output

No response

Relevant files

I have an example .mzTab I can share - unfortunately it is too big to upload in a .zip archive here, but I can supply it!

System information

No response

jpfeuffer commented 1 year ago

Yes sure. How big is it? Do you want to send it via email or use a file drop service? Can you add the consensusXML that usually is located in the same output folder and maybe the design? Probably something goes wrong when trying to "translate" into mzTab.

jackrogan commented 1 year ago

I can do either if it's convenient! The mzTab .zip is 36 MB. 104 MB for mzTab and consensusXML.

jackrogan commented 1 year ago

.zip for mzTab/consensusXML:

https://icenitherapeutics-my.sharepoint.com/:u:/g/personal/jack_rogan_stormtherapeutics_com/ESX7Dy9CClFOvAma8UKdZgsBxkf3bXUFAzQbGdMC1YqYrA?e=auSImo

jpfeuffer commented 1 year ago

Perfect.

I was about to say: Do the drop service, so every dev can check it.

jpfeuffer commented 1 year ago

Can you name an example mismatch if you have any?

jackrogan commented 1 year ago

Try EIF1A2 (Q05639) or EIF3F (O00303) - both have a mismatch in the missing data, I think.

hendrikweisser commented 1 year ago

@jackrogan, can you also share the experimental design file, please?

jackrogan commented 1 year ago

Sure thing: 20230831a_JR_IF_Protein_Expression_experimental_design_mzML.txt I changed the extension to .txt to allow it to upload.

hendrikweisser commented 1 year ago

Below is a nice plot that Jack shared with me illustrating the problem. We would expect to see the protein abundances line up with the peptide abundances (samples on the x axis).

ypriverol commented 1 year ago

@hendrikweisser @jackrogan can you check if the same problem is observed when using the Msstats input, which is another output of proteomicsLFQ.

hendrikweisser commented 1 year ago

can you check if the same problem is observed when using the Msstats input

The MSstats input contains only peptide-level abundances, so the problem isn't apparent there. Only in the mzTab.

timosachsenberg commented 1 year ago

picking out: IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK and posting some notes mainly for myself...

According to the PSM section is it found in fraction 4 of the first three study variables corresponding to files

PSM     IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK        79876   sp|Q05639|EF1A2_HUMAN   1       Human_reference_proteome_decoy null     [, , Percolator, 3.05]  4.096895298253978e-05   21-UNIMOD:35,28-UNIMOD:35       2171.528915913588662    3      1210.90625       1210.911009505437733    ms_run[38]:controllerType=0 controllerNumber=1 scan=21815       K       G      181      212     1.87686e-03     1.53633e-04     0       IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK
PSM     IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK        516877  sp|Q05639|EF1A2_HUMAN   1       Human_reference_proteome_decoy null     [, , Percolator, 3.05]  1.048975151277202e-04   21-UNIMOD:35,28-UNIMOD:35       2173.560448569738583    3      1210.90869140625 1210.911009505437733    ms_run[37]:controllerType=0 controllerNumber=1 scan=21574       K       G      181      212     6.674259999999999e-03   4.373180000000001e-04   0       IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK
PSM     IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK        516878  sp|Q05639|EF1A2_HUMAN   1       Human_reference_proteome_decoy null     [, , Percolator, 3.05]  1.867936863734006e-04   21-UNIMOD:35,28-UNIMOD:35       2170.061681417906584    3      1210.910400390625        1210.911009505437733    ms_run[39]:controllerType=0 controllerNumber=1 scan=21863       K      G181     212     0.0102437       7.62893e-04     0       IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK

run 37,38,39 correspond to following files:

1   4   /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230721b_JR_IF_Protein_Expression_10.raw   1   1
2   4   /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230721b_JR_IF_Protein_Expression_11.raw   1   2
3   4   /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230721b_JR_IF_Protein_Expression_12.raw   1   3

It also makes sense that only these files have peptide intensities assigned:

PEH     sequence        accession       unique  database        database_version        search_engine modifications    retention_time  retention_time_window   charge  mass_to_charge  spectra_ref     peptide_abundance_study_variable[1] 
  peptide_abundance_study_variable[2]    peptide_abundance_study_variable[3] ... 
peptide_abundance_study_variable[12]    opt_global_cv_MS:1002217_decoy_peptide
PEP     IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK        sp|Q05639|EF1A2_HUMAN   21-UNIMOD:35,28-UNIMOD:35       2176.172193648107623    null    3      1210.911009505437733     ms_run[38]:controllerType=0 controllerNumber=1 scan=21815       1.23646728e08   1.242208e08      1.14578376e08   null    ...    null    17444314298292841008    0       IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK

the corresponding protein entries look like the plot and have zero intensities for e.g. study variable 1-3:

PRH accession   description taxid   species database    database_version    search_engine   best_search_engine_score[1] ambiguity_members   modifications   protein_coverage    protein_abundance_assay[1]  protein_abundance_assay[2]  protein_abundance_assay[3]  protein_abundance_assay[4]  protein_abundance_assay[5]  protein_abundance_assay[6]  protein_abundance_assay[7]  protein_abundance_assay[8]  protein_abundance_assay[9]  protein_abundance_assay[10] protein_abundance_assay[11] protein_abundance_assay[12] protein_abundance_study_variable[1] protein_abundance_stdev_study_variable[1]   protein_abundance_std_error_study_variable[1]   protein_abundance_study_variable[2] protein_abundance_stdev_study_variable[2]   protein_abundance_std_error_study_variable[2]   protein_abundance_study_variable[3] protein_abundance_stdev_study_variable[3]   protein_abundance_std_error_study_variable[3]   protein_abundance_study_variable[4] protein_abundance_stdev_study_variable[4]   protein_abundance_std_error_study_variable[4]   protein_abundance_study_variable[5] protein_abundance_stdev_study_variable[5]   protein_abundance_std_error_study_variable[5]   protein_abundance_study_variable[6] protein_abundance_stdev_study_variable[6]   protein_abundance_std_error_study_variable[6]   protein_abundance_study_variable[7] protein_abundance_stdev_study_variable[7]   protein_abundance_std_error_study_variable[7]   protein_abundance_study_variable[8] protein_abundance_stdev_study_variable[8]   protein_abundance_std_error_study_variable[8]   protein_abundance_study_variable[9] protein_abundance_stdev_study_variable[9]   protein_abundance_std_error_study_variable[9]   protein_abundance_study_variable[10]    protein_abundance_stdev_study_variable[10]  protein_abundance_std_error_study_variable[10]  protein_abundance_study_variable[11]    protein_abundance_stdev_study_variable[11]  protein_abundance_std_error_study_variable[11]  protein_abundance_study_variable[12]    protein_abundance_stdev_study_variable[12]  protein_abundance_std_error_study_variable[12]  opt_global_Posterior_Probability_score  opt_global_nr_found_peptides    opt_global_cv_PRIDE:0000303_decoy_hit   opt_global_result_type
PRT     sp|Q05639|EF1A2_HUMAN   Elongation factor 1-alpha 2 OS=Homo sapiens GN=EEF1A2 PE=1 SV=1 null    null    Human_reference_proteome_decoy    null     null    9.578544061302683e-05   sp|Q05639|EF1A2_HUMAN   410-UNIMOD:35   0.505399568034557       0.0     0.0    0.0  0.0      0.0     0.0     5.24809152e08   5.5484896e08    0.0     9.1475168e08    7.48456704e08   4.44111296e08   0.0    null   null     0.0     null    null    0.0     null    null    0.0     null    null    0.0     null    null    0.0     null   null   5.24809152e08    null    null    5.5484896e08    null    null    0.0     null    null    9.1475168e08    null    null   7.48456704e08  null     null    4.44111296e08   null    null    null    null    null    single_protein

Maybe because all these peptides were modified "IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK"? Can you double check if that could be the case?

timosachsenberg commented 1 year ago

lets take a look at SV 12. Corresponds to these runs:

12  1   /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230816b_JR_IF_Protein_Expression_15.raw   1   12
12  2   /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230816b_JR_IF_Protein_Expression_18.raw   1   12
12  3   /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230816b_JR_IF_Protein_Expression_21.raw   1   12
12  4   /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230816b_JR_IF_Protein_Expression_24.raw   1   12

MTD assay[12]-ms_run_ref    ms_run[12],ms_run[24],ms_run[36],ms_run[48]

which are properly annotated in the mzTab.

But: there are no unique PSMs for the protein present in these runs (only shared peptides).

@jpfeuffer do we also quantify based on shared peptides? Looks to me like we quantify based on all peptides https://github.com/OpenMS/OpenMS/blob/develop/src/topp/ProteomicsLFQ.cpp#L1761C35-L1761C35

jpfeuffer commented 1 year ago

Only if shared across an indistinguishable group-only and only if the option is set.

timosachsenberg commented 1 year ago

Looking at the consensusXML it indeed looks like some unique peptides are not quantified in e.g., SV12.

        <consensusElement id="e_6379324141785154685" quality="0.803306" charge="4">
                        <centroid rt="1880.265226511645551" mz="643.611085131962" it="4.648045e07"/>
                        <groupedElementList>
SV1,fraction 4:                 <element map="36" id="18434921922409072992" rt="1873.627987067993672" mz="643.610828454396142" it="9.943691e07" charge="4"/>
SV2,fraction 4:                 <element map="37" id="3903738955282815979" rt="1873.186935449301018" mz="643.610828454396142" it="9.476169e07" charge="4"/>
SV3,fraction 4:                 <element map="38" id="8771180694178519987" rt="1873.84276088181673" mz="643.610828454396142" it="8.270501e07" charge="4"/>
SV4,fraction 4:                 <element map="39" id="15324052864196123987" rt="1872.84315779356939" mz="643.610828454396142" it="6.915728e05" charge="4"/>
SV5,fraction 4:                 <element map="40" id="9363500607376664293" rt="1872.002199846342592" mz="643.610828454396142" it="7.715933e05" charge="4"/>
SV7,fraction 4:                 <element map="42" id="6214631694926889497" rt="1892.193955495920591" mz="643.610828454396142" it="5.904072e07" charge="4"/>
SV8,fraction 4:                 <element map="43" id="13131907856929658537" rt="1891.762716804133788" mz="643.613395230054152" it="6.694387e07" charge="4"/>
SV9,fraction 4:                 <element map="44" id="5243948147904516742" rt="1889.77487967837601" mz="643.610828454396142" it="6.043696e07" charge="4"/>
SV10,raction 4:                 <element map="45" id="4055752267570037793" rt="1883.014810463561844" mz="643.610828454396142" it="1.159503e04" charge="4"/>
SV11,fction 4:                 <element map="46" id="16076938908342281004" rt="1880.402861635437148" mz="643.610828454396142" it="4594.880371" charge="4"/>
                        </groupedElementList>
                        <PeptideIdentification identification_run_ref="PI_0" score_type="q-value" higher_score_better="false" significance_threshold="0" MZ="643.61279296875" RT="1872.37237016588" spectrum_reference="controllerType=0 controllerNumber=1 scan=16981" >
                                <PeptideHit score="9.01599437401951e-06" sequence="VETGILRPGMVVTFAPVNITTEVK" charge="4" aa_before="R" aa_after="S" start="266" end="289" protein_refs="PH_112">
                                        <UserParam type="string" name="target_decoy" value="target"/>
                                        <UserParam type="float" name="q-value" value="1.68435e-04"/>
                                        <UserParam type="float" name="MS:1001493_score" value="7.74354e-04"/>
                                </PeptideHit>
                                <UserParam type="string" name="MS:1001115" value="16981"/>
                                <UserParam type="string" name="FFId_category" value="internal"/>
                                <UserParam type="string" name="feature_id" value="6379324141785154685"/>
                                <UserParam type="int" name="map_index" value="38"/>
                                <UserParam type="int" name="id_merge_index" value="38"/>
                        </PeptideIdentification>
                        <UserParam type="string" name="feature_id" value="6379324141785154685"/>
                </consensusElement>

Example of not quantified:

     <UnassignedPeptideIdentification identification_run_ref="PI_0" score_type="q-value" higher_score_better="false" significance_threshold="0" MZ="643.61279296875" RT="1882.39501337366" spectrum_reference="controllerType=0 controllerNumber=1 scan=16981" >
                <PeptideHit score="9.01599437401951e-06" sequence="VETGILRPGMVVTFAPVNITTEVK" charge="4" aa_before="R" aa_after="S" start="266" end="289" protein_refs="PH_112">
                        <UserParam type="string" name="target_decoy" value="target"/>
                        <UserParam type="float" name="q-value" value="1.68435e-04"/>
                        <UserParam type="float" name="MS:1001493_score" value="7.74354e-04"/>
                </PeptideHit>
                <UserParam type="string" name="MS:1001115" value="16981"/>
                <UserParam type="string" name="FFId_category" value="internal"/>
                <UserParam type="string" name="feature_id" value="not mapped"/>
                <UserParam type="int" name="map_index" value="47"/>
                <UserParam type="int" name="id_merge_index" value="47"/>
        </UnassignedPeptideIdentification>

which is from SV 12, Fraction 4

timosachsenberg commented 1 year ago

ok I think I can reproduce something with: ProteinQuantifier -in 20230831a_JR_IF_Protein_Expression_experimental_design_mzML_openms_design_openms.consensusXML -out test.csv -top:N 0 -top:include_all

timosachsenberg commented 1 year ago

I think I was able to track down the issue but I need to investigate it a bit more. Protein abundances look much better though with a quick fix.

ypriverol commented 1 year ago

@timosachsenberg is this affecting also Msstats output?

timosachsenberg commented 1 year ago

At first glance, it seems not affected. The issue was only in protein quantifier.

ypriverol commented 1 year ago

Another small, please double-check if it can be also affecting the TMT, because protein quantifier is used in the TMT workflow.

timosachsenberg commented 1 year ago

I think TMT could have this bug, but maybe we are lucky. If the bug appears depends on the order how files are merged. I will take this discussion to slack from here...

ypriverol commented 1 year ago

@jackrogan @hendrikweisser in the recent version of quantms bigbio dev branch, your bug has been fixed. Can you take a look and check if is fixed.?

jackrogan commented 1 year ago

I've got the same sample set running again on the bigbio/dev version as I type - I'll update as soon as the results are in!

jackrogan commented 1 year ago

Unfortunately, running the same command with the dev version has resulted in this error:

Process NFCORE_QUANTMS:QUANTMS:LFQ:PROTEOMICSLFQ (20230831a_JR_IF_Protein_Expression_experimental_design_mzML_openms_design) terminated with an error exit status (6)

Is there anything I need to change when running the latest version dev branch?

timosachsenberg commented 1 year ago

can you provide the ProteomicsLFQ log file?

jackrogan commented 1 year ago

proteomicslfq.log

timosachsenberg commented 1 year ago

it complains about an unknown "-id_transfer_threshold" parameter which should be part of the dev version. right @ypriverol

ypriverol commented 1 year ago

yes, but it has a default value of 0.5, let me check. Probably you pull a version that was not working properly in dev. Can you pull @jackrogan the current version in dev. It must be working fine.

jackrogan commented 1 year ago

The version I pulled on Friday says nf-core/quantms v1.2.0dev-g8db0645 - is there a later version?

ypriverol commented 1 year ago

This is the latest version bigbio/quantms dev-8db0645

What command did you run with your pipeline.

jackrogan commented 1 year ago

nextflow run bigbio/quantms -r dev -profile docker --input 20230831a_JR_IF_Protein_Expression_experimental_design_mzML.tsv --database /home/jack.rogan/Proteomics/Human_reference_proteome.fasta --add_decoys --search_engines msgf --enzyme Trypsin --precursor_mass_tolerance 10 --precursor_mass_tolerance_unit ppm --fragment_mass_tolerance 10 --fragment_mass_tolerance_unit ppm --variable_mods "Oxidation (M),Acetyl (Protein N-term)" --max_precursor_charge 5 --min_peptide_length 7 --FDR_level psm-level-fdrs --max_memory 48.GB --outdir 20230929a_JR_IF_Protein_Expression_msgf_allow_single_feat --labelling_type "label free sample" --acquisition_method dda --normalize true --transfer_ids mean --targeted_only false --msstats_remove_one_feat_prot false --msstatslfq_removeFewMeasurements false

ypriverol commented 1 year ago

The command looks ok to me. Can you double check that the container you are running is the latest version of openms? Can you update the following container ghcr.io/openms/openms-executables:latest

jackrogan commented 1 year ago

I updated openms, and this time it worked great! It looks like proteomicsLFQ is fixed!

ypriverol commented 1 year ago

Fantastic !!!! Thanks to @timosachsenberg for fixing this bug. I will close the issue.

jpfeuffer commented 1 year ago

Thank you @timosachsenberg

jackrogan commented 1 year ago

Many thanks!

hendrikweisser commented 1 year ago

Thanks for your help, everyone (especially Timo)! When will the fix appear in the nf-core version of the workflow?

ypriverol commented 1 year ago

This is a good question. We are now working in three major problems:

Full support for SAGE (now fixing bugs)
Support in DIA for Bruker (now fixing bugs)
FDR control for MBR (developing of the algorithm)

I hope we can release mid-October.

bigbio / quantms