Nesvilab / philosopher

PeptideProphet, PTMProphet, ProteinProphet, iProphet, Abacus, and FDR filtering
https://philosopher.nesvilab.org
GNU General Public License v3.0
111 stars 19 forks source link

Philosopher Filter module panics #63

Closed JB91451 closed 5 years ago

JB91451 commented 5 years ago

Dear All,

Today I run the philosopher v 1.4.4 with a custom formatted database (header line consists only of a unique identifier [made of letters, numbers and underscores], that is eventually prefxed with "Reverse_") and comet. Unfortunately, it allways crashes during the filtering step with the following error message. In this example I set the FDR to 10% to ensure enough data points in this meaningless test search. The same happens during the real search with strict FDR filtering:

############ C:\TPP\data\test>cd C:\TPP\data\test

C:\TPP\data\test>C:\TPP\bin\philosopher_144\philosopher filter --ion 0.100 --pep 0.100 --pepxml interact-181114_qe1_pm1_ths_spp_bsubt_intracell_lyscf3d.pep.xml --picked --prot 0.100 --protxml interact.prot.xml --psm 0.100 --tag Reverse INFO[10:28:31] Executing Filter v1.4.4 INFO[10:28:31] Processing peptide identification files INFO[10:28:31] 1+ Charge profile decoy=0 target=0 INFO[10:28:31] 2+ Charge profile decoy=1129 target=1610 INFO[10:28:31] 3+ Charge profile decoy=372 target=1081 INFO[10:28:31] 4+ Charge profile decoy=229 target=574 INFO[10:28:31] 5+ Charge profile decoy=244 target=304 INFO[10:28:31] 6+ Charge profile decoy=86 target=116 INFO[10:28:31] Database search results ions=4484 peptides=4037 psms=5745 INFO[10:28:31] Converged to 10.00 % FDR with 1639 PSMs decoy=164 threshold=0.8569 total=1803 INFO[10:28:31] Converged to 9.96 % FDR with 562 Peptides decoy=56 threshold=0.9863 total=618 INFO[10:28:31] Converged to 10.00 % FDR with 858 Ions decoy=85 threshold=0.9506 total=943 INFO[10:28:31] Protein inference results decoy=1749 target=1640 panic: runtime error: index out of range

goroutine 1 [running]: github.com/prvst/philosopher/lib/fil.ProtXMLFilter(0x0, 0x0, 0xc00001e1d8, 0x8, 0xc001960000, 0xcf1, 0xd99, 0xc00266b9c0, 0x12, 0x3fb999999999999a, ...) /home/prvst/go/src/github.com/prvst/philosopher/lib/fil/fil.go:1079 +0x1c4f github.com/prvst/philosopher/lib/fil.processProteinIdentifications(0x0, 0x0, 0xc00001e1d8, 0x8, 0xc001960000, 0xcf1, 0xd99, 0xc00266b9c0, 0x12, 0x3fb999999999999a, ...) /home/prvst/go/src/github.com/prvst/philosopher/lib/fil/fil.go:609 +0x320 github.com/prvst/philosopher/lib/fil.Run(0xc00001a840, 0x24, 0xc00001eb30, 0x10, 0xc0000206e0, 0x46, 0xc000018a20, 0x1f, 0xc000018a40, 0x16, ...) /home/prvst/go/src/github.com/prvst/philosopher/lib/fil/fil.go:67 +0x1030 github.com/prvst/philosopher/cmd.glob..func5(0x8717ea0, 0xc0000d62d0, 0x0, 0xf) /home/prvst/go/src/github.com/prvst/philosopher/cmd/filter.go:51 +0xaa8 github.com/spf13/cobra.(Command).execute(0x8717ea0, 0xc0000d61e0, 0xf, 0xf, 0x8717ea0, 0xc0000d61e0) /home/prvst/go/src/github.com/spf13/cobra/command.go:766 +0x2b5 github.com/spf13/cobra.(Command).ExecuteC(0x8716940, 0x405826, 0xc00005a058, 0x0) /home/prvst/go/src/github.com/spf13/cobra/command.go:850 +0x303 github.com/spf13/cobra.(*Command).Execute(...) /home/prvst/go/src/github.com/spf13/cobra/command.go:800 github.com/prvst/philosopher/cmd.Execute() /home/prvst/go/src/github.com/prvst/philosopher/cmd/root.go:32 +0x35 main.main() /home/prvst/go/src/github.com/prvst/philosopher/main.go:24 +0x75

##############

I guess that this happens due to the custom format header of the database but it would require a lot of changes on the downstream processes to alter the header to a uniprot format and just putting the "sp|" left to the identifier and creating the decoys within philosopher was not a solution. Is there anything I could try? I would really like to use the filter module.

Best, Juergen

prvst commented 5 years ago

@JB91451

It doesn't look to me that the error is coming from the protein headers, but I would require some testing in order to verify. Can you send me your pepxml, protxml and database files ?

JB91451 commented 5 years ago

Dear Felipe, I send you an email with a download link to the corresponding files yesterday via your "@umich" adress. Did you receive it?

prvst commented 5 years ago

Yes, I'll return to you when I have updates.

JB91451 commented 5 years ago

Dear Felipe, I - again - had a closer look on the error that I wrote you about last week. Unlike my conclusion back than, it seems to be a problem with the protein mapping and not the decoy recognition itself. When I run the filter and report comand with my custom database in the most recent philosopher version (1.4.6, through FragPipe this time) the protein column in the psm's table is not identical to the mapped proteins column. I guess that only the later one is considered during protein inference for the ions / peptides or proteins table as in these tables some of the entries originating from forward proteins (according to the psm table and the peptide sequence) obtain the "Reverse_" prefix that seems for me to randomly occure in the mapped proteins column. Additionally for several proteins there is no mapped protein and in cases where a fw. protein was mapped, this protein was always the one following the actual one in the database (e.g. lines 555 or 730).

I attached you the psm.tsv file, so you can have a look.

Best, Juergen

psm.zip

Golden-proteogenomics commented 5 years ago

yes,I was also find the database for search is not conclude the contaminated proteins, So, what I can do for the correct to build a database to search?

JB91451 commented 5 years ago

Dear Felipe,

Finally I managed to change all the downstream processing, to allow UniProt format headers. After applying the refresh parser from the TPP on some old pep.xml files using an accordingly formatted database (in addition to the "sp" prefix the trick was to duplicate the old header and separate by the pipe symbol, e.g. ">Seq_0001" becomes ">sp|Seq_0001|Seq_0001" - just in case anyone else has a similar problem...) it seems that the filtering works well now. In my opinion, it would be great if you could consider adding the option for such a conversion in the database module of a future release. I belive many people who work with custom databases would appreciate this.

I also have one (hopefully) last question: Is it the intended behaviour of the filter and report that the "protein.tsv" contains information from the unfiltered prot.xml? For example, the columns with the protein group number, the Top Peptide Probability, Stripped Peptides, Total Peptide Ions, Unique Peptide Ions are directly taken from the prot.xml. This results in some cases in which e.g. the number of Stripped Peptides is larger than the number in Total Spectral Counts.

Best wishes, Juergen

anesvi commented 5 years ago

Felipe, Juergen

Yes I agree it can be confusing. Protein group number, the Top Peptide Probability are OK. The protein group number is useful for linking to prot.xml info Top peptide probability is used for FDR filtering, so useful as well

However: Stripped Peptides, Total Peptide Ions, Unique Peptide Ions – those we should calculate from the filtered data. Not take from prot.xml

I would suggest removing Stripped Peptides, Total Peptide Ions, Unique Peptide Ions Or, better, replace them with Total Peptide Count; Unique Peptide Count; Razor peptide Count that corresponds to the existing Spectral Count Columns

Total Peptide Count

Unique Peptide Count

Razor Peptide Count

Total Spectral Count

Unique Spectral Count

Razor Spectral Count

Best Alexey

From: JB91451 [mailto:notifications@github.com] Sent: Wednesday, July 31, 2019 10:25 AM To: Nesvilab/philosopher philosopher@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [Nesvilab/philosopher] Philosopher Filter module panics (#63)

External Email - Use Caution

Dear Felipe,

Finally I managed to change all the downstream processing, to allow UniProt format headers. After applying the refresh parser from the TPP on some old pep.xml files using an accordingly formatted database (in addition to the "sp" prefix the trick was to duplicate the old header and separate by the pipe symbol, e.g. ">Seq_0001" becomes ">sp|Seq_0001|Seq_0001" - just in case anyone else has a similar problem...) it seems that the filtering works well now. In my opinion, it would be great if you could consider adding the option for such a conversion in the database module of a future release. I belive many people who work with custom databases would appreciate this.

I also have one (hopefully) last question: Is it the intended behaviour of the filter and report that the "protein.tsv" contains information from the unfiltered prot.xml? For example, the columns with the protein group number, the Top Peptide Probability, Stripped Peptides, Total Peptide Ions, Unique Peptide Ions are directly taken from the prot.xml. This results in some cases in which e.g. the number of Stripped Peptides is larger than the number in Total Spectral Counts.

Best wishes, Juergen

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/philosopher/issues/63?email_source=notifications&email_token=AIIMM66S3XSIZK3F423C5Z3QCGOFJA5CNFSM4ICEZRH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HNO5A#issuecomment-516872052, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIIMM62E2SDCQVADGBSLXMTQCGOFJANCNFSM4ICEZRHQ.


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

Golden-proteogenomics commented 5 years ago

Yes, this is a good idear for user to use. But it would be better if you can give some examples of open-search result, especially how to find and confirme the modification of peptides for users that will be easily to process.

At 2019-07-31 23:27:05, "Alexey Nesvizhskii" notifications@github.com wrote: Felipe, Juergen

Yes I agree it can be confusing. Protein group number, the Top Peptide Probability are OK. The protein group number is useful for linking to prot.xml info Top peptide probability is used for FDR filtering, so useful as well

However: Stripped Peptides, Total Peptide Ions, Unique Peptide Ions – those we should calculate from the filtered data. Not take from prot.xml

I would suggest removing Stripped Peptides, Total Peptide Ions, Unique Peptide Ions Or, better, replace them with Total Peptide Count; Unique Peptide Count; Razor peptide Count that corresponds to the existing Spectral Count Columns

Total Peptide Count

Unique Peptide Count

Razor Peptide Count

Total Spectral Count

Unique Spectral Count

Razor Spectral Count

Best Alexey

From: JB91451 [mailto:notifications@github.com] Sent: Wednesday, July 31, 2019 10:25 AM To: Nesvilab/philosopher philosopher@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [Nesvilab/philosopher] Philosopher Filter module panics (#63)

External Email - Use Caution

Dear Felipe,

Finally I managed to change all the downstream processing, to allow UniProt format headers. After applying the refresh parser from the TPP on some old pep.xml files using an accordingly formatted database (in addition to the "sp" prefix the trick was to duplicate the old header and separate by the pipe symbol, e.g. ">Seq_0001" becomes ">sp|Seq_0001|Seq_0001" - just in case anyone else has a similar problem...) it seems that the filtering works well now. In my opinion, it would be great if you could consider adding the option for such a conversion in the database module of a future release. I belive many people who work with custom databases would appreciate this.

I also have one (hopefully) last question: Is it the intended behaviour of the filter and report that the "protein.tsv" contains information from the unfiltered prot.xml? For example, the columns with the protein group number, the Top Peptide Probability, Stripped Peptides, Total Peptide Ions, Unique Peptide Ions are directly taken from the prot.xml. This results in some cases in which e.g. the number of Stripped Peptides is larger than the number in Total Spectral Counts.

Best wishes, Juergen

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/philosopher/issues/63?email_source=notifications&email_token=AIIMM66S3XSIZK3F423C5Z3QCGOFJA5CNFSM4ICEZRH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HNO5A#issuecomment-516872052, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIIMM62E2SDCQVADGBSLXMTQCGOFJANCNFSM4ICEZRHQ.


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

prvst commented 5 years ago

@anesvi

I'll add this request to the next release to-do list