compomics / compomics-utilities

Open source Java library for computational proteomics
http://compomics.github.io/projects/compomics-utilities.html
29 stars 17 forks source link

FASTA load error in PeptideMapping #30

Closed vnaum closed 5 years ago

vnaum commented 5 years ago

This fasta file fails to load -- while both sequences load OK separately:

>prf||1012271B
EAVVTQESALTTSPGGTVILTCRSSTGAVTTSNYANWVQEKPDHLFTGLIGGTSNRAPGVPVRFSGSLIG
DKAALTITGAQTEDDAMYFCALWYSTHFVFGGGTKVTVLG

>prf||1012271A
EAVVTQESALTTSPGGTVILTCRSSTGAVTTSNYANWVQEKPDHLFTGLIGGTSNRAPGVPVRFSGSLIG
DKAALTITGAQTEDDAMYFCALWYSTHFIFGSGTKVTVLG

This is how I run it:

java -cp /usr/local/utilities-4.11.17/utilities-4.11.17.jar com.compomics.util.experiment.identification.protein_inference.executable.PeptideMapping -p filtered_proteins.fasta peptides_132273.mzML.csv clstr_132273.mzML.csv

tested with 4.12.6, same result. Peptides list might be arbitrary, apparently it never gets to load it.

vnaum commented 5 years ago

another example triggering the issue:

>pir||JC7976
MPFAAVDIQDDCGSPDVPQANPKRSKEEEEDRGDKNDHVKKRKKAKKDYQPNYFLSIPITNKKITTGIKV
LQNSILQQDKRLTKAMVGDGSFHITLLVMQLLNEDEVNIGTDALLELKPFVEEILEGKHLALPFQGIGTF
QGQVGFVKLADGDHVSALLEIAETAKRTFREKGILAGESRTFKPHLTFMKLSKAPMLRKKGVRKIEPGLY
EQFIDHRFGEELLYQIDLCSMLKKKQSNGYYHCESSIVIGEKDRREPEDAELVRLSKRLVENAVLKAAQQ
YLEETQNKKQPGEGNSTKAEEGDRNGDGSDNNRK

>pir||JC7975
MGGKWSKSSVVGWPTVRERMRRAEPAADGVGAASRDLEKHGAITSSNTAATTNAACAWLEQEEEEVGFPV
TPQVPLRPMTYKAAVDLSHFLKEKGGLEGLIHSQRRQDILDLWIYHTQGYFPDWQNYTPGPGVRYPLTFG
WCYKLVPVEPDKVEEANKGENTSLLHPVSLHGMDDPEREVEWRFDSRLAFHHVARELHPEYFKNC
hbarsnes commented 5 years ago

I'd recommend taking a look at our Database Help wiki (https://github.com/compomics/searchgui/wiki/DatabaseHelp), especially the Non Standard FASTA section: https://github.com/compomics/searchgui/wiki/DatabaseHelp#non-standard-fasta.

I assume that we are not able to extraxt the accession number due to the non-standard format and thus end up with two proteins with the same accession number. In that case, an easy search-and-replace to obtain the suggested generic format should solve the problem.

hbarsnes commented 5 years ago

I think he means to replace pir|| with generic|. But if we simply remove pir|| or prf|| does it still fail?

Both should work. For the latter the whole header will be used as the accession number. So the result should be the same.

hbarsnes commented 5 years ago

Issue assumed resolved. If this is not the case, please let us know and we'll reopen the issue.