TAMU-CPT / CPT-ToolshedSource

1 stars 1 forks source link

Protein comparative tool #9

Open curtisim0 opened 3 weeks ago

curtisim0 commented 3 weeks ago

From Jason:

Is the tool very specific for only the NCBI DB format?

Problem: is organism reliably identified in the headers of proteins?

Retool for Uniprot protein headers? These seem to always contain organism info in header

Headers from the CPT Galaxy databases (retrieved from the NCBI ftp repo??) work with the tool, they contain explicit organism names in the headers, last field in the header in [square brackets]

It is possible to download only phage protein datasets from NCBI:

Will try this out on usegalaxy.eu

This is a good tool for phages with little to no DNA identity, still useful

jasonjgill commented 3 weeks ago

Proteins downloaded from the ncbi-asn1/protein_fasta repo have organism names in [square brackets] at the end of the header. These appear to be unique for phage names e.g., [Mycobacterium phage Aminay], [Serratia phage Muldoon]. This could be a good bet for parsing out as an organism ID. FASTA headers from the protein_fasta repo have this form:

protein_accession protein name [organism name] MXXXXXXXXX

AIX32998.1 hypothetical protein Syn7803US50_14 [Synechococcus phage ACG-2014f] MAEQNWERRILQSFGANFRLDVSNPQKTVGGEDVYNFYSVTDEEKVCLMGQQQDGLWRLYNDDKVEIVGG AKVVEDGVCVTIVGKNGDVVINADNNGRVRIRGQNINLQADEDVNITAGRNVNIKSGSGRTLLAGNTLEK DALKGNLLDPEKQWAWRVFEGTGLPAGMFPQLMSPFSGITDLAGSIVGGVGFGDAISGAVSSAVSGAVSG

WPJ71242.1 RNA polymerase sigma factor for late transcription [Escherichia phage vB-Eco-KMB39] MSETKPKYNYVNNKELLQAIIDWKTELANNKDPNKVVRQNDTIGLAIMLIAEGLSKRFNFSGYTQSWKQE MIADGIEASIKGLHNFDETKYKNPHAYITQACFNAFVQRIKKERKEVAKKYSYFVHNVYDSRDDDMVALV DETFIQDIYDKMTHYEESTYRTPGAEKKSVVDDSPSLDFLYEAND

jasonjgill commented 3 weeks ago

Also the ability to read BLASTXML output doesn't work, I think Anthony wrote it to only recognize XML2 which only our own Galaxy was set to make. If the tool could parse data out of XML output that might make it more portable. Here is 1 hit, the [organism] would show up in the field. The tools needs to count each protein hit only 1 time (i.e, if your phage protein query hit a subject protein in 3 hsp's, that would only count as 1 hit).

/XML

1 Query_1 99974bdd-2c83-421a-b7a3-dab7d44153ab (mRNA) 270 residues [Milagro:3193-4021 + strand] [peptide] name=Milagro.orf00003-00001-00001 270 1 gnl|BL_ORD_ID|1190084 UNY41722.1 capsid scaffolding protein [Burkholderia phage Milagro] 1190084 270 1 551.206 1419 0 1 270 1 270 0 0 270 270 0 270 MATNKTKFFRVAVEGATVDGREIKREWLTQMAKNYNRELYGARLNIEHLKGWAPLSATNPFGAYGDVIALKASEIEDGPLKGKMGLYAQLDPTDELVALSKKRQKVFTSIEVNPDFADIGEAYLVGLAATDDPASLGTEALQFAARRSNNLFSAACETSIEFEGEPESTSLLSIVKGMFARNRSTDDQRDADVRHAVEEIAGFASQQGRDVAALRVDLTAAQQDAAAAKKRADEAVAAVEALTAKLSATDNGAPRRQPSTGSTGELVTDC MATNKTKFFRVAVEGATVDGREIKREWLTQMAKNYNRELYGARLNIEHLKGWAPLSATNPFGAYGDVIALKASEIEDGPLKGKMGLYAQLDPTDELVALSKKRQKVFTSIEVNPDFADIGEAYLVGLAATDDPASLGTEALQFAARRSNNLFSAACETSIEFEGEPESTSLLSIVKGMFARNRSTDDQRDADVRHAVEEIAGFASQQGRDVAALRVDLTAAQQDAAAAKKRADEAVAAVEALTAKLSATDNGAPRRQPSTGSTGELVTDC MATNKTKFFRVAVEGATVDGREIKREWLTQMAKNYNRELYGARLNIEHLKGWAPLSATNPFGAYGDVIALKASEIEDGPLKGKMGLYAQLDPTDELVALSKKRQKVFTSIEVNPDFADIGEAYLVGLAATDDPASLGTEALQFAARRSNNLFSAACETSIEFEGEPESTSLLSIVKGMFARNRSTDDQRDADVRHAVEEIAGFASQQGRDVAALRVDLTAAQQDAAAAKKRADEAVAAVEALTAKLSATDNGAPRRQPSTGSTGELVTDC
jasonjgill commented 2 days ago

Galaxy17-[Top_BlastP_Hits].txt

jasonjgill commented 2 days ago

Galaxy16-[Galaxy2-[BLASTp_all_phages_comparison].tabular].txt

jasonjgill commented 2 days ago

example blastp output which would be input for the tool:

Galaxy6-[blastp_Peptide_sequences_from_Apollo_vs_protein_BLAST_database_from_data_3].txt