Nesvilab / FragPipe

A cross-platform proteomics data analysis suite
http://fragpipe.nesvilab.org
Other
184 stars 37 forks source link

Some proteins appearing in protein file but not peptide file #646

Closed ndtivendale closed 1 year ago

ndtivendale commented 2 years ago

Describe the bug In my output files, the protein.tsv file contains some proteins that are not present in the peptide.tsv file or the [filename].tsv file. Can you explain to me what is going on?

Also, can you explain the difference between the peptide.tsv file and the [filename].tsv file? I understand they have different headers, but what is the difference apart from that?


If you're submitting a bug report, please attach log file

The log file can be saved from FragPipe:

prvst commented 2 years ago

Hi @ndtivendale , To help you I first need to see what you're doing, so please share your files , including the logs and the outputs. [filename].tsv, is not part of the Philosopher output.

ndtivendale commented 2 years ago

OK. Here are the output files for one rep. milla00490592b.xlsx protein_t-24_2.xlsx psm_t-24_2.xlsx

milla00490592b was the original file name. What is the difference between this output file and the psm output file?

And why do some proteins appear in the protein file but not the psm or milla00490592b file?

Here is the log. log_2022-04-13_11-37-40.txt

anesvi commented 2 years ago

You are using an old version of the tools

Version info: FragPipe version 16.0 MSFragger version 3.3 Philosopher version 4.0.0 (build 1626989421)

Please upgrade to the latest FragPipe 17.1 and the latest philosopher. There were many fixes since the versions you used, so we cannot go back to look at your files. If you still see an issue with the latest versions we will be able to investigate

Best Alexey

Get Outlook for iOShttps://aka.ms/o0ukef


From: Nathan @.> Sent: Monday, April 18, 2022 9:12:50 AM To: Nesvilab/FragPipe @.> Cc: Subscribed @.***> Subject: Re: [Nesvilab/FragPipe] Some proteins appearing in protein file but not peptide file (Issue #646)

External Email - Use Caution

OK. Here are the output files for one rep. milla00490592b.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502802/milla00490592b.xlsx protein_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502803/protein_t-24_2.xlsx psm_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502804/psm_t-24_2.xlsx

milla00490592b was the original file name. What is the difference between this output file and the psm output file?

And why do some proteins appear in the protein file but not the psm or [milla00490592b file?

milla00490592b.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502802/milla00490592b.xlsx protein_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502803/protein_t-24_2.xlsx psm_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502804/psm_t-24_2.xlsx

Here is the log log_2022-04-13_11-37-40.txthttps://github.com/Nesvilab/FragPipe/files/8502810/log_2022-04-13_11-37-40.txt g

— Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/646#issuecomment-1100999168, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM6YYLAK2UHCQ5OY6HTLVFSZJFANCNFSM5TJBAXXA. You are receiving this because you are subscribed to this thread.Message ID: @.***>


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

ndtivendale commented 2 years ago

OK, but when I tried to do that recently, I did not get any Arabidopsis proteins. Only things like human keratin and porcine trypsin. There is a problem with the new version (not sure which tool is causing the problem), which I reported in a separate issue thread.

prvst commented 2 years ago

@ndtivendale Are you using a database from TAIR ?

ndtivendale commented 2 years ago

Yep. Can't upload it here because it's in fasta format but here is converted to a df in r. tair10_fasta_dataframe.csv

prvst commented 2 years ago

@ndtivendale You had an issue because the version you have does not know how to read a TAIR database header. You can grab a pre-release version which contains the function you need

https://www.dropbox.com/work/Public/Philosopher/Release%20Candidate

ndtivendale commented 2 years ago

@prvst Thanks, I will do that.

That's solved one issue. The other issues remain though.

  1. Why are some proteins appearing in the protein file but not the peptide file?
  2. What is the difference between the peptide.tsv, psm.tsv and the [filename].tsv file? They are different lengths, I can see that. And for individual protein IDs there are fewer peptides in the peptide file than the psm.tsv and [filename].tsv file, so I assume it's some sort of filtering step, but I would like to know what filters are being applied.
ndtivendale commented 2 years ago

@prvst is there another way you can share that file? I can't seem to get it from dropbox. Something to do with needing a Dropbox Business Account.

prvst commented 2 years ago

@ndtivendale [filename].tsv files are not created by philosopher, so I can;t tell you why exactly they are different. Regarding your fist question, I'm afraid I'm going to need some examples. If you can generate new outputs with the version in the link, then I'll be able to look at that for you. The Dropbox link should be public. I reviewed the permissions. Please try again.

https://www.dropbox.com/sh/0mr4zbprhaxk453/AADdLawYWnQ_-tekDkdnLscWa?dl=0

dpolasky commented 2 years ago

@ndtivendale the [filename].tsv files are generated by MSFragger before any FDR filtering is done, so they will contain more (typically many more) entries than the Philosopher outputs since they have no FDR applied. They are typically not needed for most analyses - you can set the MSFragger output to just "pepXML" rather than "pepXML_tsv" if you don't want them.

ndtivendale commented 2 years ago

@dpolasky OK. But what is the difference between psm and peptide outputs?

prvst commented 2 years ago

The PSM table contains the list of all (FDR approved) PSMs from the experiment.The peptide table is the list of(FDR approved) peptides. To make this list, we collapse all PSMs to the peptide sequence.

ndtivendale commented 2 years ago

@prvst OK, I'll try that. In the meantime, here are three examples of proteins that are in the protein file but not the psm file for the replicate that I posted at the beginning of the thread. There are 175 more examples of such proteins in this replicate. AT1G12010.1 AT1G47500.1 AT2G29470.1

ndtivendale commented 2 years ago

@prvst, thank you.

ndtivendale commented 2 years ago

@dpolasky, so there is no filtering applied to the [filename] file at all?

dpolasky commented 2 years ago

@ndtivendale there's almost no filtering - MSFragger will only report spectra with at least minimum_peaks ions in the spectrum and min_matched_fragments ions matched to a peptide, but nothing other than that (nothing related to score or FDR control).

ndtivendale commented 2 years ago

@dpolasky Thanks. How does the FDR filtering work? This may be getting a little off topic, but I want to understand.

dpolasky commented 2 years ago

@ndtivendale Without going too much into the details, the 'raw' output from MSFragger gets modeled in PeptideProphet (and has protein inference done in ProteinProphet) before the actual FDR filtering done by Philosopher. It looks like you're using the "sequential" filtering approach from the log files, which means that a first pass FDR is done to 1% at the PSM, ion, peptide, and protein levels, and then a second pass is done to remove any PSMs/etc from proteins that did not pass the protein FDR filter. You can see that happening in the log - the output of the filter command shows the numbers passing the filter after the first and second passes, and the number of decoy PSMs after the second pass typically drops to well below the actual set FDR. This is within each experiment group, so synchronizing across many groups can be tricky as different proteins may pass FDR in different sub-groups

ndtivendale commented 2 years ago

OK. That answers one question and I thank you all for that. But what about the main issue I raised. Why are some proteins present in the protein file but not the psm file? I could understand the other way around. I could understand a peptide being mapped to a particular protein in the psm file but then filtered out in the protein file, but if it's present in the protein file, surely it should be present in the psm file as well, right?

prvst commented 2 years ago

Hi @ndtivendale. As we mentioned above, and in the other issue you opened, you're reporting a problem with a quite old version of philosopher and fragpipe. Please update your tools to the latest version, run them again, and report back if you still see discrepancies.

ndtivendale commented 2 years ago

OK, so I have generated some data from the new version. I've attached the psm and protein files for one sample. There are proteins that appear in the protein file, but not in the psm file. For example AT1G01820.1. psm_000_1.xlsx protein_000_1.xlsx log_2022-05-18_20-05-04.txt

prvst commented 2 years ago

The protein is actually in the file, but it's classified as an alternative protein, not as a maing identification. I'll try to reproduce your case here.

ndtivendale commented 2 years ago

@prvst Where? Sorry, I am confused.

prvst commented 2 years ago

AT1G01820.1 can be found in the Mapped proteins in the psm table, and its in the Protein column in the protein table.

ndtivendale commented 2 years ago

@prvst OK. So there is evidence that AT1G01820.1 is in the sample but the peptide supporting it is better mapped to another protein? I'm confused. Is there a tutorial I can look at?

prvst commented 2 years ago

Exactly, your target protein is sharing peptides with another protein. The tools that perform the validation and the inference determined that they should consider the other protein the "main" identification based on their criteria of validation. If you want to read more about how these steps happen, and what the tools are doing, I suggest this really nice review from Alexey.

ndtivendale commented 2 years ago

@prvst OK, but I still have a problem. It's a smaller problem now, but there are still proteins that appear in the protein file but not in the psm table, even in the Mapped Proteins column. For example, in the files I shared earlier, AT4G08140.1 is listed in the protein file but not in the psm file.

prvst commented 2 years ago

@ndtivendale Can you generate a new set of results using the releases from Friday? Lets check if you still see this happening with the latest release, if so I can take a look for you.

ndtivendale commented 2 years ago

OK, I did that. There are nowhere near as many examples now, but there are still some. There are two examples now in the attached files. The examples are AT4G05590.2 and AT4G05590.1. protein_t-24_2.tsv.xlsx psm_t-24_2.tsv.xlsx

prvst commented 2 years ago

I can take a look for you, but I'll need your files. Can you send them to me?

ndtivendale commented 2 years ago

@prvst you mean the mzML files?

prvst commented 2 years ago

Send me the interact files, and the database

ndtivendale commented 2 years ago

@prvst Which ones are the interact files?

prvst commented 2 years ago

Check the folder for files with interact in the name, and also the combined prot.xml, and the database

ndtivendale commented 2 years ago

[EDITED - Felipe]

Got it. Thank you.

prvst commented 2 years ago

Could you explain to me how did you assemble this database? The format doesn't seem to match the TAIR official format, and the description of certain entries is extremely long. The numbers also don't match to the Arabidopsis genome, so I'm curious.

ndtivendale commented 2 years ago

I just downloaded it from TAIR. That's it.

jjGG commented 2 years ago

Hello everyone,

I want to share here that we also have/had issues with the TAIR-db in FP17. We nailed it down that the reason is the “|" (pipes) in the description lines. In our case - all proteins from TAIR disappeared and only the contaminants where identified.

Furthermore we also discovered that there is reason to believe that if you have NO description at all for a protein but only an accession-number this protein also "disappeared" from the results in our "commandline" triggered FragPipe analysis but NOT in the FP-GUI.

We solved it by modifying the database.

cheers - jonas

prvst commented 2 years ago

I asked about the database because I got a snapshot of the current proteome from TAIR, and the format and number differed from what you showed to me. What you have is very similar to a set of sequences based on genomics data.

ndtivendale commented 2 years ago

Here's where I got it from: https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FProteins%2FTAIR10_protein_lists

The first file on that page.

prvst commented 2 years ago

Indeed, that seems to be a set of sequences predicted from the TAIR10 release. For some reason, they include orthology information in the FASTA headers, which might be handy in certain situations, but highly unusual. Why don't you try the Araport11 release instead?

https://www.arabidopsis.org/download_files/Proteins/Araport11_protein_lists/Araport11_pep_20220103.gz

EDIT: An example of what I mentioned above:

>AT1G24851.1 | Symbols: | unknown protein; BEST Arabidopsis thaliana protein match is: unknown protein (TAIR:AT1G25025.1); Has 6812 Blast hits to 2172 proteins in 306 species: Archae - 0; Bacteria - 350; Metazoa - 1330; Fungi - 775; Plants - 255; Viruses - 55; Other Eukaryotes - 4047 (source: NCBI BLink). | chr1:8778280-8779056 FORWARD LENGTH=258

ndtivendale commented 2 years ago

I have downloaded the Araport11 release and am currently trying that.

ndtivendale commented 2 years ago

I still have the same problem with a limited number of proteins. In these files, AT1G61150.10 and ATMG01275-2.1 are in the protein but not the psm file (the latter is also a notation I've not encountered before, but that's a separate issue). The log for generating these files is attached. log_2022-06-22_05-39-56.txt protein_000_1.xlsx psm_000_1.xlsx

prvst commented 2 years ago

@ndtivendale I can't reproduce your problem with the files I have, so if you want me to take a closer look, I'll need you to send me all your files

ndtivendale commented 2 years ago

@prvst But you can see the problem, right? That AT1G61150.10 is in the protein file but not the psm file for that sample?

prvst commented 2 years ago

Yes, but I need to reproduce the occurrence to see if it's an artifact based on a misusage of the parameters, or if it's a bug.

ndtivendale commented 2 years ago

OK. How can I share them with you? They're quite large.

prvst commented 2 years ago

Send me your email, and I'll send you a file request via Dropbox

ndtivendale commented 2 years ago

OK, it's nathan.tivendale@uwa.edu.au