Nesvilab / FragPipe

A cross-platform proteomics data analysis suite
http://fragpipe.nesvilab.org
Other
195 stars 38 forks source link

Mismatched data in MSFRagger results files #502

Closed treponeme closed 1 year ago

treponeme commented 2 years ago

Hello

I have been analyzing MS data after using FragPipe version 16.0/MSFragger version 3.3/Philosopher version 4.0.0 (build 1626989421) and have noticed some mis-matching data in the results files as listed below:

  1. The number of unique peptides for several proteins reported in the protein.tsv files differ from the number of unique peptides listed in the corresponding psm.tsv files. For example, TPANIC_0486_TPANIC_RS02360_hypothetical protein_WP_012460559_05072013 is reported to have 4 unique peptides in sample 256 protein.tsv file but only 3 unique peptides are listed in the sample 256 psm.tsv file. Another example is TPANIC_RS00150_TPANIC_0030_chaperonin which has 11 unique peptides in the sample 256 protein.tsv file but only 8 unique peptides in the sample 256 psm.tsv file. I have also seen examples of the reverse where fewer peptides are reported in the protein.tsv file compared to the corresponding psm.tsv file.

  2. Protein probabilities for each detected protein reported in the combined results file (combined_protein.tsv) are identical to the protein probabilities reported for all corresponding detected proteins in the three individual sample files (samples 256, 257, 258; protein.tsv files), so only one protein probability value exists for each protein in all four files.

I would greatly appreciate it if you could look into these data mismatches / discrepancies - I have attached the zipped log file for the three samples (samples 256, 257, and 258) and the combined data.

log_2021-07-30_07-16-48.zip

Thank you for your help with this matter

fcyu commented 2 years ago

Felipe @prvst , can you take a look when you have time?

Thanks,

Fengchao

treponeme commented 2 years ago

Hello

I have noticed one other issue with the data in this conversation (same zipped log file as above: https://github.com/Nesvilab/FragPipe/files/7419927/log_2021-07-30_07-16-48.zip):

Many proteins that were detected at high probabilities (>0.95) in the protein.tsv files of each of the individual samples (samples 256, 257, 258) are listed with values equal to zero for "Total Peptides", "Unique Peptides", "Razor peptides", “Total Spectral Count”, "Unique Spectral Count" and "Razor Spectral Count". For example, in the protein.tsv file for sample 256, the protein "TPANIC_0453_TPANIC_RS02210_membrane" is assigned with a protein probability=1, however zero total peptides, zero unique peptides, zero razor peptides, zero total spectral counts, zero unique spectral counts, and zero razor spectral counts are reported. Can these proteins still be considered high confidence proteins that were detected?

Thank you again for your help with these matters.

prvst commented 2 years ago

Hi @treponeme, I'll take a look at your log files. This is a label-free, multi-experiment analysis, is that correct?

fcyu commented 2 years ago

Hi @treponeme ,

Can you re-run your analysis with the latest FragPipe (17.0) and the latest Philosopher (4.1.0)? You can skip MSFragger and Percolator by unchecking them.

If the issue is still there, can you send us your fasta file?

Thanks,

Fengchao

treponeme commented 2 years ago

?Hi Felipe

Yes, this was a label-free experiment consisting of three separate samples.

Thanks


From: Felipe da Veiga Leprevost @.***> Sent: November 4, 2021 12:00 PM To: Nesvilab/FragPipe Cc: Simon Houston; Mention Subject: Re: [Nesvilab/FragPipe] Mismatched data in MSFRagger results files (Issue #502)

Hi @treponemehttps://github.com/treponeme, I'll take a look at your log files. This is a label-free, multi-experiment analysis, is that correct?

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/502#issuecomment-961332376, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWGWYRJ7TDHYT3HRJHZ3EZTUKLQ7JANCNFSM5GYIYJUQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

treponeme commented 2 years ago

Hello Fengchao

The analyses have been re-run using the latest software, but the issues still persist.

I have attached the fasta file database used for searching here.

Thanks. Tp_Cust_Rab_Crap.zip

fcyu commented 2 years ago

Thanks for your testing and fasta file.

Hi Felipe @prvst ,

Can you take a look at the fasta file? There are proteins from UniProt and non-UniProt. Maybe that's the cause.

>TPANIC_0052_YP_008090927_HemK family methyltransferase_05072013
MYCVSRECNLVQELCTIRQARMYARALFQDAPCLRGQNTPLLDADLILSKLLAKPRAWILAHQQDEIASVAHEFKRLVHLRCRGRALAYLTREKEFFGLRFRVTRATLIPKPDTELLVESVLAHVASQMMKPRSVSVHKDTSALPVLKIFEACTGCGCIAIALMHMLRARGTPPLYVIASDICMRALAVARYNARRLLDVSANSRVRFVHADVRAPIPFFSPSEGTDVVQERGVCVPYDVICANPPYVPSAQARALLQDGRGEPLGALDGGADGLDLVRAFAHHSAAALKEGGCVFCEVGSNHAQRAARIFQAAGFATVKISKDLSGKERLISGILRSQSRAVTAPSG

>TPANIC_0591_YP_008091429_bifunctional Hpr kinase phosphatase_05072013
MLKLDLKERDSLDLRCIAGHHGLANPITISDLNRPGLVLSGFFDLFAYRRIQLFGRGEHAYLLALLEQGRYGAIEKMFTFDLPCCIFSHGITPPEKFLHLAEPSSCPILVTRLTSSELSLRLMRVLSNIFAPTIALHGVLVEVYGVGILISGDSGVGKSETALELIERGHRLVADDLVEISCVNGNSLIGRGVHKSIGHHMEIRGLGIINITQLYGVGSIRERKEIQMVVQLEEWNSSKAYDRLGTQELNTTILDVSVPLIEIPVRPGRNIPIILETAAMNERLKRMGYFSAKEFNQSVLKLMEQNAAHAPYYRPDDTY

>TPANIC_0773_YP_008091601_S1 family peptidase Do_05072013
MRNKVRVLAVVAALAAACAVGFFLGRWFDFSARSSVLEAADSLSVSSSEAASFSTVVAEGDPYTVDERQNIAVYRSANEAVVNITTEMVGVNWFLEPVPLEGGSGSGAIIDARGYVLTNTHVIEGASKIYLSLHDGSQYKATVVGVDRENDLAVLKFVSPPGARLTVIRFGSSRNLDVGQKVLAIGNPFGLARTLTVGVVSALARPIQNKGSIIRNMIQTDAAINPGNSGGPLLDTQGRMIGINTVIYSTSGSSSGVGFAVPVDTAKRIVSELIRYGRVRRGKIDAELVQVNASIAHYAQLTVGKGLLVSQVKRGSPAAQAGLRGGTTAVRYGLGRRAAVIYLGGDVITAIDNQPVANLSDYYSVLEDKKPDDEVRVTVLRGRRQHVVAVRLTERSDE

>TPANIC_0841_YP_008091666_S1 family peptidase Do_05072013
MPSADTIARRVAGDSGNAGGRTLLPVGVSRESVQLLERLQNANRQVTAEVLPSVVTLDVVETRKVRVRDPFGGFPWFFFRGPEGPGAGPGGGSGNKGEAEEREYKTEGLGSGVIVKKTGKTHYVLTNYHVAGKANEIEIKLHDGRIVKGKLVGGDQRKDIALVSFEDADPNIRVAVLGDSDAVRVGDIVFAVGSPLGYTSTVTQGIISALGRFGGPGNNINDFIQTDAAINQGNSGGPMVNIYGEVIGINAWIASSSGGSQGIGFSIPINNVKSDIESFIQYGQVKYGWLGVQLVATDADTVASLGIAKGTKGVLAAEIFLGSPAHKGGLKPGDYCVKLNGKEVKDVNQFVRDVGALRIGQTAVFDLIRGGVPMTLSVRITERDEKIVNDYSKLWPGFIPLPLTEAVRKRLDLKASVRGVLVSNAQSKSPAALMGLKSADIVVAVNDQRVSSVREFYAVLARQTREVWFDVLRDGQTLSTVRFRF

Thanks,

Fengchao

prvst commented 2 years ago

The database seems OK. @treponeme can you share your output tables?

treponeme commented 2 years ago

?Hi

Here are the output tables. I will send the psm files separately.

Thanks


Simon Houston Ph.D. Research Associate Department of Biochemistry and Microbiology University of Victoria British Columbia


From: Felipe da Veiga Leprevost @.***> Sent: November 8, 2021 12:52 PM To: Nesvilab/FragPipe Cc: Simon Houston; Mention Subject: Re: [Nesvilab/FragPipe] Mismatched data in MSFRagger results files (Issue #502)

The database seems OK. @treponemehttps://github.com/treponeme can you share your output tables?

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/502#issuecomment-963562538, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWGWYRJRHM6ZRWKRLDWSX4LULA2AZANCNFSM5GYIYJUQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

treponeme commented 2 years ago

Hi

Here are the output tables.

Thanks Output_tables samples 256 257 258 and Combined.zip

treponeme commented 2 years ago

Hi Felipe@prvst

Any luck with the output files?

Thanks.

prvst commented 2 years ago

Hi @treponeme, I see 4 (TPANIC_0486_TPANIC_RS02360_hypothetical protein_WP_012460559_05072013) PSMs, 4 peptides, and the numbers match to the 256 protein table. I thought that it was resolved, did I miss something?

treponeme commented 2 years ago

Hi Felipe@prvst

Unfortunately the three issues described above in my first two comments have not been sorted out, namely -

  1. There are many cases where the number of unique peptides for several proteins reported in the protein.tsv files differ from the number of unique peptides listed in the corresponding psm.tsv files. This problem still exists for many proteins.
  2. Protein probabilities for each detected protein in the combined results file and in the three individual sample files (samples 256, 257, 258; protein.tsv files; see attached files in my first comment above) are all identical, so only one protein probability value exists for each protein in all four sample files. Protein probabilities for the same proteins detected in the three different samples should be somewhat different between the three different samples.
  3. Many proteins that were detected at high probabilities (>0.95) in the protein.tsv files (see attached files in my first comment above) of each of the individual samples (samples 256, 257, 258) are listed with values equal to zero for "Total Peptides", "Unique Peptides", "Razor peptides", “Total Spectral Count”, "Unique Spectral Count" and "Razor Spectral Count". My question is, how can these proteins be considered high probability if no peptides or spectra were found/reported?

Thank you.

prvst commented 2 years ago

Are you using the latest releases?

treponeme commented 2 years ago

To confirm, what are the latest releases? I believe I may have used the version before the latest.

prvst commented 2 years ago

Philosopher 4.1.1, and Fragpipe 17.1

treponeme commented 2 years ago

No. I used 4.1 and Fragpipe 17. Do these new versions eliminate all three problems I listed above today?

prvst commented 2 years ago

Hard to say since your issue seems to be particular to your case, some changes might affect the final reporting. Nevertheless, we need to debug the latest version since that's the current codebase we have now. I suggest you run the pipeline again, making sure all files are in place, no temporary files are present, and that your workspace is clean. If the problem you see persist, you can send me some of your data, and I'll debug the processing myself on Monday.