GDKO / AvP

Automatic evaluation of HGTs
GNU General Public License v3.0
18 stars 2 forks source link

something wrong when using diamond with nr database #11

Closed Kenlly369 closed 7 months ago

Kenlly369 commented 10 months ago

Thanks for your wonderful work on AvP. When I used diamond with Uniprot database, it went smoothly. But when I used diamond with nr database, it seems something goes wrong. Here are some of my codes: nohup diamond blastp -q ../Hrob.protein.fa -d /tools/diamond/nr_28July2023.dmnd --evalue 1e-5 --max-target-seqs 500 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids --out nr_similarity.out -p 20 > Hrob.log 2>&1 & (the diamond result seems ok)

Then: nohup ../aux_scripts/calculate_ai.py -i Hrob_nr_similarity.out -x sample.groups.yaml & (The log file showed: [!] Skipped 469494 hits) (When I used Uniprot database, no hits were skipped)

Then I tried to use avp prepare: nohup ../avp prepare -a Hrob_nr_similarity.out_ai.out -o Hrob_prepare -f Hrob.protein.fa -b Hrob_nr_similarity.out -x sample.groups.yaml -c sample.config.yaml & The log file showed: [+] Setting up [!] Selected 1351 HGT candidates [+] Parsing hits file and grouping similar queries [!] Formed 980 groups [+] Extracting hits from DB [+] Writing fasta files [!] Skipped 109465 hits and 0 taxids. [+] Aligning fasta files ........ [!] Finished with 1351 HGT candidates in 980 groups (When I used Uniprot database, no hits were skipped)

Then I “cd Hrob_prepare/tmp” : -rw-rw-r-- 1 1416512 Sep 7 17:04 extract_id.txt -rw-rw-r-- 1 0 Sep 7 17:04 setblast.fa -rw-rw-r-- 1 162 Sep 7 17:04 setblast.log -rw-rw-r-- 1 0 Sep 7 17:04 setblast.perf -rw-rw-r-- 1 0 Sep 7 17:04 taxonomy_nexus.txt Then I “less setblast.log” : BLAST Database error: No alias or index file found for protein database [/tools/diamond/nr_28July2023.dmnd] in search path [/home/zhen/AvP/Hrob_nr_input.files::]

Then I tried: nohup ../avp detect -i ./Hrob_prepare/mafftgroups/ -o Hrob_detect -g ./Hrob_prepare/groups.tsv -t ./Hrob_prepare/tmp/taxonomy_nexus.txt -c sample.config.yaml & The file fasttree_general_results.txt showd: Proteins analyzed : 1351 Proteins with no Ingroup : 0 Proteins with only Ingroup : 0

Unknown Topology : 1351 No HGT support : 0 Complex topology : 0 Strong HGT support : 0 (seems really weird)

Here’s part of my config file:

DB path

blast_db_path: /tools/diamond/nr_28July2023.dmnd # blast: change to the local blast_db path fasta_path: /tools/diamond/nr.gz # diamond: change to the local fasta path for sp, ur90, or custom database mode: blast # use blast for blast database, use sp for swissprot database, ur90 for uniref90 or custom database data_type: AA # data type DNA, AA

I found from your published paper that you also tested with the nr database. May I ask how did you do that? Is it necessary to use blastp instead of diamond when using the nr database, as the “Documentation: github wiki” says?

I also extracted a small part of data from nr.gz to test with blastp but not diamond, the code is as follows: makeblastdb -in nr.test.fa -dbtype prot -parse_seqids -out nr.test blastp -query Hrob.proteins.fa -db nr.test -outfmt '6 std staxids' -seg no -evalue 1e-5 -out Hrob.similarity.out But there was a problem with the blastp results: column 13 staxids are all 0. So I would like to ask if it is necessary to "update_blastdb.pl --decompress nr" when building nr database, and should not download nr fasta file and manually makeblastbd.

Thanks!

Best,

kenlly

bshrestha0 commented 7 months ago

Hi Kenlly, Curious if you were able to resolve the issue by yourself? I am running into same problem too.

GDKO commented 7 months ago

Hi both,

The recommended route is to use blast with nr (see wiki). Although I am fairly certain there is a way to preformat nr and use it with diamond, at this moment this is very low priority. Feel free to try and update here if you manage to do so.

Cheers, Georgios

bshrestha0 commented 7 months ago

Hi Georgios,

Thanks for your response. Blastp run against nr database took forever to finish, especially when I have ~40,000 query sequences. I was able to use diamond blastp against nr database (much faster) but needed some additional steps to run without issues.

To run diamond blastp against nr database, I downloaded nr protein sequences https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz , and create diamond database using- diamond makedb --in nr.gz --taxonmap prot.accession2taxid.FULL.gz --db nr_diamond.dmnd --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp I combined the taxonmap downloaded from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz, and taxdump from your wiki to make this step work. Then ran diamond blastp using- diamond blastp -q infile.fa -d nr_diamond.dmnd --evalue 1e-5 --max-target-seqs 500 --threads 32 --out dblastp.out --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids

To set up the config.yaml file during the avp prepare step, I downloaded nr database using- update_blastdb.pl --decompress --blastdb_version 5 nr ./diamond prepdb -d nr, as described in diamond wiki. In the config.yaml, I set up blast_db_path: /path/to/database/Diamond/nr and mode: blast. With these changes, the avp prepare and detect steps ran successfully without any errors. I also compared the result against a separate run using uniref90 database. The results are comparable but as each database has its own taxonomical representation, there was discrepancy in number of HGT predicted as well.

This may not an efficient way to run diamond blast against nr database though.

Best regards, Bikash

GDKO commented 7 months ago

Thanks @bshrestha0, I have added a mention in the wiki!