Joseph7e / Assign-Taxonomy-with-BLAST

Assign taxonomy with blast, can be used for qiime
21 stars 8 forks source link

issue with grabbing taxonomy for blast hits using blast input file #6

Open maottino opened 1 year ago

maottino commented 1 year ago

Hello! I'm extremely new to bioinformatics/python/github, so I apologize if this is a super easy fix! Thank you in advance for any help/input.

I am trying assign taxonomy to trnL (plant chloroplast) OTUs with blast results already run and output in the required format (outfmt '6 qseqid qlen sseqid pident length qstart qend sstart send evalue bitscore staxids'). I'll provide a file example of what my blast output looks like below in case that is the issue, but it's a tab delimited file with 10 top hits per OTU.

The code I use when running is: python3 taxonomy_assignment_BLAST.py TLotus.fa ./ncbi_taxonomy/expanded_ncbi_taxonomy.tsv --blast_database IGNORE --blast_file TLotus_taxhits.txt

where TLotus.fa is my file with my OTU sequences (not using since I already have a blast file) expanded_ncbi_taxonomy.tsv is taxonomy file, built as instructed (preview of what file looks like attached) TLotus_taxhits.txt is my blast_output with custom formatting

The command runs without error, but the returned taxonomy is the default for when there is no match in the taxonomy file. For example, this is what every OTU looks like this for every OTU:

BLAST LINE : Otu4 94 NC_047481.1 100 94 1 94 52771 52864 1.52E-39 174 94 354624

BLAST LINE : Otu4 94 MN308055.1 100 94 1 94 52771 52864 1.52E-39 174 94 354624

BLAST LINE : Otu4 94 MK105463.1 100 94 1 94 52769 52862 1.52E-39 174 94 3512

ASSIGNING TAXONOMY FOR Otu4 total hits passing initial filters = 3 NC_047481.1 100 --> CAPTURED after percent sway filter MN308055.1 100 --> CAPTURED after percent sway filter MK105463.1 100 --> CAPTURED after percent sway filter

Providing consensus taxonomy up to level 14 : tmp6

X100_1 superkingdom;subkingdom;sub_subkingdom;kingdom;tmp1;tmp2;phylum;class;family;genus;species;tmp3;tmp4;tmp5;tmp6 X100_2 superkingdom;subkingdom;sub_subkingdom;kingdom;tmp1;tmp2;phylum;class;family;genus;species;tmp3;tmp4;tmp5;tmp6 X100_3 superkingdom;subkingdom;sub_subkingdom;kingdom;tmp1;tmp2;phylum;class;family;genus;species;tmp3;tmp4;tmp5;tmp6

Taxonomy Assignment for Otu4 = superkingdom:subkingdom:sub_subkingdom:kingdom:tmp1:tmp2:phylum:class:family:genus:species:tmp3:tmp4:tmp5:tmp6

It looks like it's reading my blast file correctly, as it's pulling the accession number and percent ID correctly. When I search the taxonomy file for the taxids in the blast file, they are present with normal taxonomy information.

Is this some sort of basic formatting error? I've tried looking over the python code but have little to no knowledge of python coding and cannot find the issue myself!

ncbi_expanded_taxonomy_EXAMPLE.txt

TLotus_taxhits_EXAMPLE.txt

kStarr24 commented 1 year ago

Hi @Joseph7e, I am the exact same issue. I tried both giving the script blast results in the required format and allowing the script to perform the blastn command. Both ways I encountered the generic output with no taxonomy information (exactly as shown above). Not sure what's going on here but it seems it must be something related to the expanded_ncbi_taxonomy.tsv file? I created the expanded_ncbi_taxonomy.tsv following the instructions in the README file.

I would be very grateful for any advice on how to fix this.

kStarr24 commented 1 year ago

Hi @maottino, I've been digging into this, and after lots of trouble shooting I have the program working well! There are a couple things to check that might fix your issue.

As of today, the readme file for this program states that if you bring in your own blast file it should have the following format '6 qseqid qlen sseqid pident length qstart qend sstart send evalue bitscore'. However, the taxonomy_assignment_BLAST.py uses the last column of the blast table as the taxonomy id to look up in the taxonomy database. Ensure the the blast output's final column is staxids. This is the blast output format I used: "6 qseqid qlen sseqid pident length qstart qend sstart send evalue staxids". However, your post indicates that you do have staxids as the final column. I'm wondering if there was a change in the readme file between now and when you were attempting to use it?

Here is what is more likely your problem. If you look at the readme file in the usage section you can find that there is a flag called --ncbi_nt. When you include it it gets set to 'True'. You have to include this flag in the function call if you are using the ncbi database because it sends the function into an if statement that assigns the correct indexing for the NCBI taxonomy levels.

You can check to see if this is your problem by looking in the log_file.txt output. Near the beginning there is a section to check to make sure the taxonomy indexing is correct. If it is incorrect you will see something like this:

Taxonomy Categories

superkingdom:subkingdom:sub_subkingdom:kingdom:tmp1:tmp2:phylum:class:family:genus:species:tmp3:tmp4>#The below levels should match the target, i.e. species_level = species Species_level = 14 = tmp6 Family_level = 10 = species Phylum_level = 7 = class

And if it is correct you will see this:

Taxonomy Categories

superkingdom:kingdom:phylum:subphylum:superclass:class:subclass:superorder:order:superfamily:family:>#The below levels should match the target, i.e. species_level = species Species_level = 13 = species Family_level = 10 = family Phylum_level = 2 = phylum

Here is the function call that worked for me:

python3 Assign-Taxonomy-with-BLAST-master/taxonomy_assignment_BLAST.py ASVstest.fa /scratch/ncbi_taxonomy/expanded_ncbi_taxonomy.tsv --ncbi_nt --blast_file=blast_table.txt