Closed ndtivendale closed 1 year ago
Hi @ndtivendale , To help you I first need to see what you're doing, so please share your files , including the logs and the outputs. [filename].tsv, is not part of the Philosopher output.
OK. Here are the output files for one rep. milla00490592b.xlsx protein_t-24_2.xlsx psm_t-24_2.xlsx
milla00490592b was the original file name. What is the difference between this output file and the psm output file?
And why do some proteins appear in the protein file but not the psm or milla00490592b file?
Here is the log. log_2022-04-13_11-37-40.txt
You are using an old version of the tools
Version info: FragPipe version 16.0 MSFragger version 3.3 Philosopher version 4.0.0 (build 1626989421)
Please upgrade to the latest FragPipe 17.1 and the latest philosopher. There were many fixes since the versions you used, so we cannot go back to look at your files. If you still see an issue with the latest versions we will be able to investigate
Best Alexey
Get Outlook for iOShttps://aka.ms/o0ukef
From: Nathan @.> Sent: Monday, April 18, 2022 9:12:50 AM To: Nesvilab/FragPipe @.> Cc: Subscribed @.***> Subject: Re: [Nesvilab/FragPipe] Some proteins appearing in protein file but not peptide file (Issue #646)
External Email - Use Caution
OK. Here are the output files for one rep. milla00490592b.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502802/milla00490592b.xlsx protein_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502803/protein_t-24_2.xlsx psm_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502804/psm_t-24_2.xlsx
milla00490592b was the original file name. What is the difference between this output file and the psm output file?
And why do some proteins appear in the protein file but not the psm or [milla00490592b file?
milla00490592b.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502802/milla00490592b.xlsx protein_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502803/protein_t-24_2.xlsx psm_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502804/psm_t-24_2.xlsx
Here is the log log_2022-04-13_11-37-40.txthttps://github.com/Nesvilab/FragPipe/files/8502810/log_2022-04-13_11-37-40.txt g
— Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/646#issuecomment-1100999168, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM6YYLAK2UHCQ5OY6HTLVFSZJFANCNFSM5TJBAXXA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
OK, but when I tried to do that recently, I did not get any Arabidopsis proteins. Only things like human keratin and porcine trypsin. There is a problem with the new version (not sure which tool is causing the problem), which I reported in a separate issue thread.
@ndtivendale Are you using a database from TAIR ?
Yep. Can't upload it here because it's in fasta format but here is converted to a df in r. tair10_fasta_dataframe.csv
@ndtivendale You had an issue because the version you have does not know how to read a TAIR database header. You can grab a pre-release version which contains the function you need
https://www.dropbox.com/work/Public/Philosopher/Release%20Candidate
@prvst Thanks, I will do that.
That's solved one issue. The other issues remain though.
@prvst is there another way you can share that file? I can't seem to get it from dropbox. Something to do with needing a Dropbox Business Account.
@ndtivendale [filename].tsv files are not created by philosopher, so I can;t tell you why exactly they are different. Regarding your fist question, I'm afraid I'm going to need some examples. If you can generate new outputs with the version in the link, then I'll be able to look at that for you. The Dropbox link should be public. I reviewed the permissions. Please try again.
https://www.dropbox.com/sh/0mr4zbprhaxk453/AADdLawYWnQ_-tekDkdnLscWa?dl=0
@ndtivendale the [filename].tsv files are generated by MSFragger before any FDR filtering is done, so they will contain more (typically many more) entries than the Philosopher outputs since they have no FDR applied. They are typically not needed for most analyses - you can set the MSFragger output to just "pepXML" rather than "pepXML_tsv" if you don't want them.
@dpolasky OK. But what is the difference between psm and peptide outputs?
The PSM table contains the list of all (FDR approved) PSMs from the experiment.The peptide table is the list of(FDR approved) peptides. To make this list, we collapse all PSMs to the peptide sequence.
@prvst OK, I'll try that. In the meantime, here are three examples of proteins that are in the protein file but not the psm file for the replicate that I posted at the beginning of the thread. There are 175 more examples of such proteins in this replicate. AT1G12010.1 AT1G47500.1 AT2G29470.1
@prvst, thank you.
@dpolasky, so there is no filtering applied to the [filename] file at all?
@ndtivendale there's almost no filtering - MSFragger will only report spectra with at least minimum_peaks
ions in the spectrum and min_matched_fragments
ions matched to a peptide, but nothing other than that (nothing related to score or FDR control).
@dpolasky Thanks. How does the FDR filtering work? This may be getting a little off topic, but I want to understand.
@ndtivendale Without going too much into the details, the 'raw' output from MSFragger gets modeled in PeptideProphet (and has protein inference done in ProteinProphet) before the actual FDR filtering done by Philosopher. It looks like you're using the "sequential" filtering approach from the log files, which means that a first pass FDR is done to 1% at the PSM, ion, peptide, and protein levels, and then a second pass is done to remove any PSMs/etc from proteins that did not pass the protein FDR filter. You can see that happening in the log - the output of the filter command shows the numbers passing the filter after the first and second passes, and the number of decoy PSMs after the second pass typically drops to well below the actual set FDR. This is within each experiment group, so synchronizing across many groups can be tricky as different proteins may pass FDR in different sub-groups
OK. That answers one question and I thank you all for that. But what about the main issue I raised. Why are some proteins present in the protein file but not the psm file? I could understand the other way around. I could understand a peptide being mapped to a particular protein in the psm file but then filtered out in the protein file, but if it's present in the protein file, surely it should be present in the psm file as well, right?
Hi @ndtivendale. As we mentioned above, and in the other issue you opened, you're reporting a problem with a quite old version of philosopher and fragpipe. Please update your tools to the latest version, run them again, and report back if you still see discrepancies.
OK, so I have generated some data from the new version. I've attached the psm and protein files for one sample. There are proteins that appear in the protein file, but not in the psm file. For example AT1G01820.1. psm_000_1.xlsx protein_000_1.xlsx log_2022-05-18_20-05-04.txt
The protein is actually in the file, but it's classified as an alternative protein, not as a maing identification. I'll try to reproduce your case here.
@prvst Where? Sorry, I am confused.
AT1G01820.1 can be found in the Mapped proteins in the psm table, and its in the Protein column in the protein table.
@prvst OK. So there is evidence that AT1G01820.1 is in the sample but the peptide supporting it is better mapped to another protein? I'm confused. Is there a tutorial I can look at?
Exactly, your target protein is sharing peptides with another protein. The tools that perform the validation and the inference determined that they should consider the other protein the "main" identification based on their criteria of validation. If you want to read more about how these steps happen, and what the tools are doing, I suggest this really nice review from Alexey.
@prvst OK, but I still have a problem. It's a smaller problem now, but there are still proteins that appear in the protein file but not in the psm table, even in the Mapped Proteins column. For example, in the files I shared earlier, AT4G08140.1 is listed in the protein file but not in the psm file.
@ndtivendale Can you generate a new set of results using the releases from Friday? Lets check if you still see this happening with the latest release, if so I can take a look for you.
OK, I did that. There are nowhere near as many examples now, but there are still some. There are two examples now in the attached files. The examples are AT4G05590.2 and AT4G05590.1. protein_t-24_2.tsv.xlsx psm_t-24_2.tsv.xlsx
I can take a look for you, but I'll need your files. Can you send them to me?
@prvst you mean the mzML files?
Send me the interact files, and the database
@prvst Which ones are the interact files?
Check the folder for files with interact in the name, and also the combined prot.xml, and the database
[EDITED - Felipe]
Got it. Thank you.
Could you explain to me how did you assemble this database? The format doesn't seem to match the TAIR official format, and the description of certain entries is extremely long. The numbers also don't match to the Arabidopsis genome, so I'm curious.
I just downloaded it from TAIR. That's it.
Hello everyone,
I want to share here that we also have/had issues with the TAIR-db in FP17. We nailed it down that the reason is the “|" (pipes) in the description lines. In our case - all proteins from TAIR disappeared and only the contaminants where identified.
Furthermore we also discovered that there is reason to believe that if you have NO description at all for a protein but only an accession-number this protein also "disappeared" from the results in our "commandline" triggered FragPipe analysis but NOT in the FP-GUI.
We solved it by modifying the database.
cheers - jonas
I asked about the database because I got a snapshot of the current proteome from TAIR, and the format and number differed from what you showed to me. What you have is very similar to a set of sequences based on genomics data.
Here's where I got it from: https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FProteins%2FTAIR10_protein_lists
The first file on that page.
Indeed, that seems to be a set of sequences predicted from the TAIR10 release. For some reason, they include orthology information in the FASTA headers, which might be handy in certain situations, but highly unusual. Why don't you try the Araport11 release instead?
EDIT: An example of what I mentioned above:
>AT1G24851.1 | Symbols: | unknown protein; BEST Arabidopsis thaliana protein match is: unknown protein (TAIR:AT1G25025.1); Has 6812 Blast hits to 2172 proteins in 306 species: Archae - 0; Bacteria - 350; Metazoa - 1330; Fungi - 775; Plants - 255; Viruses - 55; Other Eukaryotes - 4047 (source: NCBI BLink). | chr1:8778280-8779056 FORWARD LENGTH=258
I have downloaded the Araport11 release and am currently trying that.
I still have the same problem with a limited number of proteins. In these files, AT1G61150.10 and ATMG01275-2.1 are in the protein but not the psm file (the latter is also a notation I've not encountered before, but that's a separate issue). The log for generating these files is attached. log_2022-06-22_05-39-56.txt protein_000_1.xlsx psm_000_1.xlsx
@ndtivendale I can't reproduce your problem with the files I have, so if you want me to take a closer look, I'll need you to send me all your files
@prvst But you can see the problem, right? That AT1G61150.10 is in the protein file but not the psm file for that sample?
Yes, but I need to reproduce the occurrence to see if it's an artifact based on a misusage of the parameters, or if it's a bug.
OK. How can I share them with you? They're quite large.
Send me your email, and I'll send you a file request via Dropbox
OK, it's nathan.tivendale@uwa.edu.au
Describe the bug In my output files, the protein.tsv file contains some proteins that are not present in the peptide.tsv file or the [filename].tsv file. Can you explain to me what is going on?
Also, can you explain the difference between the peptide.tsv file and the [filename].tsv file? I understand they have different headers, but what is the difference apart from that?
If you're submitting a bug report, please attach log file
The log file can be saved from FragPipe:
Export Log
button on theRun
tab.Run
tab to a text file.