Open curtisim0 opened 5 months ago
Proteins downloaded from the ncbi-asn1/protein_fasta repo have organism names in [square brackets] at the end of the header. These appear to be unique for phage names e.g., [Mycobacterium phage Aminay], [Serratia phage Muldoon]. This could be a good bet for parsing out as an organism ID. FASTA headers from the protein_fasta repo have this form:
protein_accession protein name [organism name] MXXXXXXXXX
AIX32998.1 hypothetical protein Syn7803US50_14 [Synechococcus phage ACG-2014f] MAEQNWERRILQSFGANFRLDVSNPQKTVGGEDVYNFYSVTDEEKVCLMGQQQDGLWRLYNDDKVEIVGG AKVVEDGVCVTIVGKNGDVVINADNNGRVRIRGQNINLQADEDVNITAGRNVNIKSGSGRTLLAGNTLEK DALKGNLLDPEKQWAWRVFEGTGLPAGMFPQLMSPFSGITDLAGSIVGGVGFGDAISGAVSSAVSGAVSG
WPJ71242.1 RNA polymerase sigma factor for late transcription [Escherichia phage vB-Eco-KMB39] MSETKPKYNYVNNKELLQAIIDWKTELANNKDPNKVVRQNDTIGLAIMLIAEGLSKRFNFSGYTQSWKQE MIADGIEASIKGLHNFDETKYKNPHAYITQACFNAFVQRIKKERKEVAKKYSYFVHNVYDSRDDDMVALV DETFIQDIYDKMTHYEESTYRTPGAEKKSVVDDSPSLDFLYEAND
Also the ability to read BLASTXML output doesn't work, I think Anthony wrote it to only recognize XML2 which only our own Galaxy was set to make. If the tool could parse data out of XML output that might make it more portable. Here is 1 hit, the [organism] would show up in the
/XML
example blastp output which would be input for the tool:
Galaxy6-[blastp_Peptide_sequences_from_Apollo_vs_protein_BLAST_database_from_data_3].txt
@jasonjgill I have made a new tool with the following output from your above dataset:
❯ python protein_blast_grouping.py test-data/blast-input.txt --hits 20
# Top 20 Hits
# Name Unique Query Matches Unique Subject Hits
Burkholderia phage Milagro 47 48
Burkholderia phage Momento 41 45
Burkholderia phage Musica 39 42
Burkholderia phage Menos 39 40
Burkholderia phage KL3 38 39
Burkholderia phage PhiBP82.2 35 35
Burkholderia phage PhiBP82.3 34 34
Burkholderia phage phiE202 34 34
Burkholderia phage phiE094 34 34
Burkholderia phage phiX216 33 33
Burkholderia phage phiE52237 33 33
Burkholderia phage AP3 33 33
Burkholderia phage Carl1 33 34
Burkholderia phage Mana 33 34
Burkholderia phage vB_HM387 32 32
Burkholderia phage BEK 31 31
Burkholderia phage KS5 31 32
Ralstonia phage RsoM1USA 28 28
Ralstonia phage RSA1 28 29
Burkholderia phage PK23 26 27
I reduced the complexity from the existing tool to more-or-less the "group the hits by name" requirement.
When you confirm this looks "correct", I will move on and finish the wrapping.
Q: How/What is it doing: "Unique Query Matches" tells you how many of your query proteins had at least one match in each organism. "Unique Subject Hits" tells you how many unique proteins from each organism were matched by any of your queries.
Is the input for this just the protein FASTA file? I can gin up a test organism with a known output to validate it
No, it is the file above (Galaxy6)
Yeah, let me know
Hey Curtis that output looks correct based on the inputs I used, can you wrap it? Also was this working with XML (XML1) an option? If not that option should get removed from the wrapper.
It is wrapped 👍 (part of this work)
And I will clean it up
From Jason:
Is the tool very specific for only the NCBI DB format?
Problem: is organism reliably identified in the headers of proteins?
Retool for Uniprot protein headers? These seem to always contain organism info in header
Headers from the CPT Galaxy databases (retrieved from the NCBI ftp repo??) work with the tool, they contain explicit organism names in the headers, last field in the header in [square brackets]
It is possible to download only phage protein datasets from NCBI:
Will try this out on usegalaxy.eu
This is a good tool for phages with little to no DNA identity, still useful