billzt / MiFish

This is the command line version of MiFish pipeline. It can also be used with any other eDNA meta-barcoding primers
https://mitofish.aori.u-tokyo.ac.jp/mifish/
GNU General Public License v3.0
13 stars 3 forks source link

How to Format DB for use with MitoFish #6

Closed cement-head closed 9 months ago

cement-head commented 9 months ago

I downloaded the entire database from the site: http://mitofish.aori.u-tokyo.ac.jp/species/detail/download/?filename=download%2F/complete_partial_mitogenomes.zip

Then I used this command

$ makeblastdb -in mito-all.fa -dbtype nucl

Building a new DB, current time: 09/28/2023 16:06:29
New DB name:   /home/cbfgws6/MiFish/mifishdb/mito-all.fa
New DB title:  mito-all.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 825365 sequences in 24.0024 seconds.

I then attempt to run the pipline and I get this error:

$ mifish -d /home/cbfgws6/MiFish/WTRBA_21-22-23/ seq /home/cbfgws6/MiFish/mifishdb/ -t 124 -o WTRBA_ALL
Error: /home/cbfgws6/MiFish/mifishdb/ does not seem to be a valid database for NCBI BLAST+

What am I doing wrong?

billzt commented 9 months ago

@cement-head

Hello.

There are two issues.

  1. The second parameter should be the name of database, not the name of directory. In this example, it should be /home/cbfgws6/MiFish/mifishdb/mito-all.fa, not /home/cbfgws6/MiFish/mifishdb/. The README seems confusing, and I have modified it.

  2. The database mito-all.fa is a collection of all fish's mitochondrial sequences. However this MiFish pipeline requires a database of amplicon sequence. So, mito-all.fa is not suitable in this situation.

Following is resolving methods:

  1. If your data is from MiFish amplicon sequencing, you can just use ./test/mifishdbv3.83.fa in this repository (outdated).
  2. If your data is from other eDNA primers, or you hope to use the newest data on MiFish amplicon, you can follow CRABS to make a refDB from mito-all.fa (step 1~6, using MitoFish as original source), then using the awk command to change it to FASTA format.
$ awk '{print ">gb|" $1 "|" $9 "\n" $10}' output.tsv >your.db.fa
$ makeblastdb -dbtype nucl -in your.db.fa
cement-head commented 9 months ago

Okay, so just to clarify: (1) Install CRABS (2) Download mitofish DB using CRABS (Step 1.4) (3) Download NCBI Taxonomy database (Step 1.5) (4) Use db_import to import the mitofish db into CRABS (Step 2) (5) To extract the amplicon sequences, should I use Step 4.1 or 4.2, or both? (6) Assign TAXA (Step 5) (7) Dereplicate the database.

Then, use the commands above to change the <.tsv> file to a <.fa> file, and make the database using makeblastdb.

Have I got that right?

billzt commented 9 months ago

Yes, overall that's right except for:

(4) is not necessary. (4) is used for in-house generated or curated data.

(5) Both Step 4.1 and 4.2 are recommended.

cement-head commented 9 months ago

In case anyone else needs an updated MitoFish DB, here's one made October 1st, 2023. mitofish-db-October2023.tar.gz

cement-head commented 9 months ago

Well, that didn't work:

BLAST Database error: Error: Not a valid version 4 database.
Traceback (most recent call last):
  File "/home/cbfgws6/miniconda3/envs/MiFish/bin/mifish", line 33, in <module>
    sys.exit(load_entry_point('mifish', 'console_scripts', 'mifish')())
  File "/home/cbfgws6/MiFish/mifish/cmd/mifish.py", line 76, in main
    pipeline.runMiFish(data_dir=args.seq_dir, data_dir_other_groups=data_dir_other_groups, \
  File "/home/cbfgws6/MiFish/mifish/core/pipeline.py", line 247, in runMiFish
    for blast_record in NCBIXML.parse(handle):
  File "/home/cbfgws6/miniconda3/envs/MiFish/lib/python3.9/site-packages/Bio/Blast/NCBIXML.py", line 799, in parse
    raise ValueError("Your XML file was empty")
ValueError: Your XML file was empty
cement-head commented 9 months ago

Nevermind, WAY too many versions of blastn on my machine - version conflict.

zhangjl-work commented 4 months ago

In case anyone else needs an updated MitoFish DB, here's one made October 1st, 2023. mitofish-db-October2023.tar.gz

Where does the data included in the Mitofish database come from? A detailed description would be appreciated