Closed jackrogan closed 1 year ago
Yes sure. How big is it? Do you want to send it via email or use a file drop service? Can you add the consensusXML that usually is located in the same output folder and maybe the design? Probably something goes wrong when trying to "translate" into mzTab.
I can do either if it's convenient! The mzTab .zip is 36 MB. 104 MB for mzTab and consensusXML.
Perfect.
I was about to say: Do the drop service, so every dev can check it.
Can you name an example mismatch if you have any?
Try EIF1A2 (Q05639) or EIF3F (O00303) - both have a mismatch in the missing data, I think.
@jackrogan, can you also share the experimental design file, please?
Sure thing: 20230831a_JR_IF_Protein_Expression_experimental_design_mzML.txt I changed the extension to .txt to allow it to upload.
Below is a nice plot that Jack shared with me illustrating the problem. We would expect to see the protein abundances line up with the peptide abundances (samples on the x axis).
@hendrikweisser @jackrogan can you check if the same problem is observed when using the Msstats input, which is another output of proteomicsLFQ.
can you check if the same problem is observed when using the Msstats input
The MSstats input contains only peptide-level abundances, so the problem isn't apparent there. Only in the mzTab.
picking out: IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK and posting some notes mainly for myself...
According to the PSM section is it found in fraction 4 of the first three study variables corresponding to files
PSM IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK 79876 sp|Q05639|EF1A2_HUMAN 1 Human_reference_proteome_decoy null [, , Percolator, 3.05] 4.096895298253978e-05 21-UNIMOD:35,28-UNIMOD:35 2171.528915913588662 3 1210.90625 1210.911009505437733 ms_run[38]:controllerType=0 controllerNumber=1 scan=21815 K G 181 212 1.87686e-03 1.53633e-04 0 IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK
PSM IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK 516877 sp|Q05639|EF1A2_HUMAN 1 Human_reference_proteome_decoy null [, , Percolator, 3.05] 1.048975151277202e-04 21-UNIMOD:35,28-UNIMOD:35 2173.560448569738583 3 1210.90869140625 1210.911009505437733 ms_run[37]:controllerType=0 controllerNumber=1 scan=21574 K G 181 212 6.674259999999999e-03 4.373180000000001e-04 0 IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK
PSM IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK 516878 sp|Q05639|EF1A2_HUMAN 1 Human_reference_proteome_decoy null [, , Percolator, 3.05] 1.867936863734006e-04 21-UNIMOD:35,28-UNIMOD:35 2170.061681417906584 3 1210.910400390625 1210.911009505437733 ms_run[39]:controllerType=0 controllerNumber=1 scan=21863 K G181 212 0.0102437 7.62893e-04 0 IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK
run 37,38,39 correspond to following files:
1 4 /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230721b_JR_IF_Protein_Expression_10.raw 1 1
2 4 /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230721b_JR_IF_Protein_Expression_11.raw 1 2
3 4 /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230721b_JR_IF_Protein_Expression_12.raw 1 3
It also makes sense that only these files have peptide intensities assigned:
PEH sequence accession unique database database_version search_engine modifications retention_time retention_time_window charge mass_to_charge spectra_ref peptide_abundance_study_variable[1]
peptide_abundance_study_variable[2] peptide_abundance_study_variable[3] ...
peptide_abundance_study_variable[12] opt_global_cv_MS:1002217_decoy_peptide
PEP IGYNPATVPFVPISGWHGDNMLEPSPNMPWFK sp|Q05639|EF1A2_HUMAN 21-UNIMOD:35,28-UNIMOD:35 2176.172193648107623 null 3 1210.911009505437733 ms_run[38]:controllerType=0 controllerNumber=1 scan=21815 1.23646728e08 1.242208e08 1.14578376e08 null ... null 17444314298292841008 0 IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK
the corresponding protein entries look like the plot and have zero intensities for e.g. study variable 1-3:
PRH accession description taxid species database database_version search_engine best_search_engine_score[1] ambiguity_members modifications protein_coverage protein_abundance_assay[1] protein_abundance_assay[2] protein_abundance_assay[3] protein_abundance_assay[4] protein_abundance_assay[5] protein_abundance_assay[6] protein_abundance_assay[7] protein_abundance_assay[8] protein_abundance_assay[9] protein_abundance_assay[10] protein_abundance_assay[11] protein_abundance_assay[12] protein_abundance_study_variable[1] protein_abundance_stdev_study_variable[1] protein_abundance_std_error_study_variable[1] protein_abundance_study_variable[2] protein_abundance_stdev_study_variable[2] protein_abundance_std_error_study_variable[2] protein_abundance_study_variable[3] protein_abundance_stdev_study_variable[3] protein_abundance_std_error_study_variable[3] protein_abundance_study_variable[4] protein_abundance_stdev_study_variable[4] protein_abundance_std_error_study_variable[4] protein_abundance_study_variable[5] protein_abundance_stdev_study_variable[5] protein_abundance_std_error_study_variable[5] protein_abundance_study_variable[6] protein_abundance_stdev_study_variable[6] protein_abundance_std_error_study_variable[6] protein_abundance_study_variable[7] protein_abundance_stdev_study_variable[7] protein_abundance_std_error_study_variable[7] protein_abundance_study_variable[8] protein_abundance_stdev_study_variable[8] protein_abundance_std_error_study_variable[8] protein_abundance_study_variable[9] protein_abundance_stdev_study_variable[9] protein_abundance_std_error_study_variable[9] protein_abundance_study_variable[10] protein_abundance_stdev_study_variable[10] protein_abundance_std_error_study_variable[10] protein_abundance_study_variable[11] protein_abundance_stdev_study_variable[11] protein_abundance_std_error_study_variable[11] protein_abundance_study_variable[12] protein_abundance_stdev_study_variable[12] protein_abundance_std_error_study_variable[12] opt_global_Posterior_Probability_score opt_global_nr_found_peptides opt_global_cv_PRIDE:0000303_decoy_hit opt_global_result_type
PRT sp|Q05639|EF1A2_HUMAN Elongation factor 1-alpha 2 OS=Homo sapiens GN=EEF1A2 PE=1 SV=1 null null Human_reference_proteome_decoy null null 9.578544061302683e-05 sp|Q05639|EF1A2_HUMAN 410-UNIMOD:35 0.505399568034557 0.0 0.0 0.0 0.0 0.0 0.0 5.24809152e08 5.5484896e08 0.0 9.1475168e08 7.48456704e08 4.44111296e08 0.0 null null 0.0 null null 0.0 null null 0.0 null null 0.0 null null 0.0 null null 5.24809152e08 null null 5.5484896e08 null null 0.0 null null 9.1475168e08 null null 7.48456704e08 null null 4.44111296e08 null null null null null single_protein
Maybe because all these peptides were modified "IGYNPATVPFVPISGWHGDNM(Oxidation)LEPSPNM(Oxidation)PWFK"? Can you double check if that could be the case?
lets take a look at SV 12. Corresponds to these runs:
12 1 /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230816b_JR_IF_Protein_Expression_15.raw 1 12
12 2 /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230816b_JR_IF_Protein_Expression_18.raw 1 12
12 3 /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230816b_JR_IF_Protein_Expression_21.raw 1 12
12 4 /mnt/MS_Data/Jack/Proteomics/20230816b_JR_IF_Protein_Expression/20230816b_JR_IF_Protein_Expression_24.raw 1 12
MTD assay[12]-ms_run_ref ms_run[12],ms_run[24],ms_run[36],ms_run[48]
which are properly annotated in the mzTab.
But: there are no unique PSMs for the protein present in these runs (only shared peptides).
@jpfeuffer do we also quantify based on shared peptides? Looks to me like we quantify based on all peptides https://github.com/OpenMS/OpenMS/blob/develop/src/topp/ProteomicsLFQ.cpp#L1761C35-L1761C35
Only if shared across an indistinguishable group-only and only if the option is set.
Looking at the consensusXML it indeed looks like some unique peptides are not quantified in e.g., SV12.
<consensusElement id="e_6379324141785154685" quality="0.803306" charge="4">
<centroid rt="1880.265226511645551" mz="643.611085131962" it="4.648045e07"/>
<groupedElementList>
SV1,fraction 4: <element map="36" id="18434921922409072992" rt="1873.627987067993672" mz="643.610828454396142" it="9.943691e07" charge="4"/>
SV2,fraction 4: <element map="37" id="3903738955282815979" rt="1873.186935449301018" mz="643.610828454396142" it="9.476169e07" charge="4"/>
SV3,fraction 4: <element map="38" id="8771180694178519987" rt="1873.84276088181673" mz="643.610828454396142" it="8.270501e07" charge="4"/>
SV4,fraction 4: <element map="39" id="15324052864196123987" rt="1872.84315779356939" mz="643.610828454396142" it="6.915728e05" charge="4"/>
SV5,fraction 4: <element map="40" id="9363500607376664293" rt="1872.002199846342592" mz="643.610828454396142" it="7.715933e05" charge="4"/>
SV7,fraction 4: <element map="42" id="6214631694926889497" rt="1892.193955495920591" mz="643.610828454396142" it="5.904072e07" charge="4"/>
SV8,fraction 4: <element map="43" id="13131907856929658537" rt="1891.762716804133788" mz="643.613395230054152" it="6.694387e07" charge="4"/>
SV9,fraction 4: <element map="44" id="5243948147904516742" rt="1889.77487967837601" mz="643.610828454396142" it="6.043696e07" charge="4"/>
SV10,raction 4: <element map="45" id="4055752267570037793" rt="1883.014810463561844" mz="643.610828454396142" it="1.159503e04" charge="4"/>
SV11,fction 4: <element map="46" id="16076938908342281004" rt="1880.402861635437148" mz="643.610828454396142" it="4594.880371" charge="4"/>
</groupedElementList>
<PeptideIdentification identification_run_ref="PI_0" score_type="q-value" higher_score_better="false" significance_threshold="0" MZ="643.61279296875" RT="1872.37237016588" spectrum_reference="controllerType=0 controllerNumber=1 scan=16981" >
<PeptideHit score="9.01599437401951e-06" sequence="VETGILRPGMVVTFAPVNITTEVK" charge="4" aa_before="R" aa_after="S" start="266" end="289" protein_refs="PH_112">
<UserParam type="string" name="target_decoy" value="target"/>
<UserParam type="float" name="q-value" value="1.68435e-04"/>
<UserParam type="float" name="MS:1001493_score" value="7.74354e-04"/>
</PeptideHit>
<UserParam type="string" name="MS:1001115" value="16981"/>
<UserParam type="string" name="FFId_category" value="internal"/>
<UserParam type="string" name="feature_id" value="6379324141785154685"/>
<UserParam type="int" name="map_index" value="38"/>
<UserParam type="int" name="id_merge_index" value="38"/>
</PeptideIdentification>
<UserParam type="string" name="feature_id" value="6379324141785154685"/>
</consensusElement>
Example of not quantified:
<UnassignedPeptideIdentification identification_run_ref="PI_0" score_type="q-value" higher_score_better="false" significance_threshold="0" MZ="643.61279296875" RT="1882.39501337366" spectrum_reference="controllerType=0 controllerNumber=1 scan=16981" >
<PeptideHit score="9.01599437401951e-06" sequence="VETGILRPGMVVTFAPVNITTEVK" charge="4" aa_before="R" aa_after="S" start="266" end="289" protein_refs="PH_112">
<UserParam type="string" name="target_decoy" value="target"/>
<UserParam type="float" name="q-value" value="1.68435e-04"/>
<UserParam type="float" name="MS:1001493_score" value="7.74354e-04"/>
</PeptideHit>
<UserParam type="string" name="MS:1001115" value="16981"/>
<UserParam type="string" name="FFId_category" value="internal"/>
<UserParam type="string" name="feature_id" value="not mapped"/>
<UserParam type="int" name="map_index" value="47"/>
<UserParam type="int" name="id_merge_index" value="47"/>
</UnassignedPeptideIdentification>
which is from SV 12, Fraction 4
ok I think I can reproduce something with: ProteinQuantifier -in 20230831a_JR_IF_Protein_Expression_experimental_design_mzML_openms_design_openms.consensusXML -out test.csv -top:N 0 -top:include_all
I think I was able to track down the issue but I need to investigate it a bit more. Protein abundances look much better though with a quick fix.
@timosachsenberg is this affecting also Msstats output?
At first glance, it seems not affected. The issue was only in protein quantifier.
Another small, please double-check if it can be also affecting the TMT, because protein quantifier is used in the TMT workflow.
I think TMT could have this bug, but maybe we are lucky. If the bug appears depends on the order how files are merged. I will take this discussion to slack from here...
@jackrogan @hendrikweisser in the recent version of quantms bigbio dev branch, your bug has been fixed. Can you take a look and check if is fixed.?
I've got the same sample set running again on the bigbio/dev version as I type - I'll update as soon as the results are in!
Unfortunately, running the same command with the dev version has resulted in this error:
Process NFCORE_QUANTMS:QUANTMS:LFQ:PROTEOMICSLFQ (20230831a_JR_IF_Protein_Expression_experimental_design_mzML_openms_design)
terminated with an error exit status (6)
Is there anything I need to change when running the latest version dev branch?
can you provide the ProteomicsLFQ log file?
it complains about an unknown "-id_transfer_threshold" parameter which should be part of the dev version. right @ypriverol
yes, but it has a default value of 0.5, let me check. Probably you pull a version that was not working properly in dev. Can you pull @jackrogan the current version in dev. It must be working fine.
The version I pulled on Friday says nf-core/quantms v1.2.0dev-g8db0645 - is there a later version?
This is the latest version bigbio/quantms dev-8db0645
What command did you run with your pipeline.
nextflow run bigbio/quantms -r dev -profile docker --input 20230831a_JR_IF_Protein_Expression_experimental_design_mzML.tsv --database /home/jack.rogan/Proteomics/Human_reference_proteome.fasta --add_decoys --search_engines msgf --enzyme Trypsin --precursor_mass_tolerance 10 --precursor_mass_tolerance_unit ppm --fragment_mass_tolerance 10 --fragment_mass_tolerance_unit ppm --variable_mods "Oxidation (M),Acetyl (Protein N-term)" --max_precursor_charge 5 --min_peptide_length 7 --FDR_level psm-level-fdrs --max_memory 48.GB --outdir 20230929a_JR_IF_Protein_Expression_msgf_allow_single_feat --labelling_type "label free sample" --acquisition_method dda --normalize true --transfer_ids mean --targeted_only false --msstats_remove_one_feat_prot false --msstatslfq_removeFewMeasurements false
The command looks ok to me. Can you double check that the container you are running is the latest version of openms? Can you update the following container ghcr.io/openms/openms-executables:latest
I updated openms, and this time it worked great! It looks like proteomicsLFQ is fixed!
Fantastic !!!! Thanks to @timosachsenberg for fixing this bug. I will close the issue.
Thank you @timosachsenberg
Many thanks!
Thanks for your help, everyone (especially Timo)! When will the fix appear in the nf-core version of the workflow?
This is a good question. We are now working in three major problems:
I hope we can release mid-October.
Description of the bug
Output mzTab has mismatches between the peptide and protein sections of the data. For example, protein intensities listed for study-variable/protein combinations that have no peptide intensities at all, and vice-versa. The MSstats stage of the pipeline works fine, and the output matches the peptide table. I've seen this only in datasets for fractionated samples. I haven't seen it in non-fractionated datasets.
Command used and terminal output
No response
Relevant files
I have an example .mzTab I can share - unfortunately it is too big to upload in a .zip archive here, but I can supply it!
System information
No response