GWang2022 / iGDP

An integrated Genome Decontamination Pipeline for wild ciliated microeukaryotes
GNU General Public License v3.0
4 stars 2 forks source link

mmseqs DB #1

Open stephen-14 opened 8 months ago

stephen-14 commented 8 months ago

Hello every one, I'm, trying to run the tool. However, after creating the DB via the mmseqs2 by my download NR.fasta that I had before, I run the homology search and I failed, and the feedback like this: "$iGDP_homology_search.pl -i assemly.fa.gz -o homology_search -d mmseqDB Input mmseqDB does not exist Input mmseqDB does not exist mv: cannot stat 'output*': No such file or directory mv: cannot stat 'tmp': No such file or directory awk: fatal: cannot open file `homology_search.split.1000bpbin.mmseqs.out' for reading: No such file or directory" After creating, indexing the DB by mmseqs, I got a lot of following files, actually I confused me, which file is correct to use? Does anyone has experience with this? or how to solve this matter? Thank you.

GWang2022 commented 8 months ago

Please first check if you have created the NR.fasta database for mmseqs using command like below: mmseqs createdb NR.fasta mmseqDB or mmseqs databases NR mmseqDB tmpDir Then try to use the absolute path of the mmseqDB after the parameter -d.

Hope it can work well. Thanks for your issue.

stephen-14 commented 8 months ago

Hello, thanks for your suggestion. However, during the homology_search, I encountered with another issue. "GDP_homology_search.pl -i C.fasta -o homolog_search -d /home/biolab_0/DB/mmseqDB names.dmp, nodes.dmp, merged.dmp from NCBI taxdump could not be found! mkdir: cannot create directory ‘mmseqs/’: File exists awk: fatal: cannot open file `homolog_search.split.1000bpbin.mmseqs.out' for reading: No such file or directory" It stuck there.

  1. I directly ran the DB after creating DB in MMseqs without indexing, does it works?
  2. There are a lot of files such as DB_h, DB.dbtype and so on, It confused me to use the exact file for DB in homology_search. which file is correct to use? Thank you and sorry for my silly questions!
GWang2022 commented 8 months ago

I think you did not correctly creat a taxonomy database for your mmseqDB. Please see mmseqs homepage https://github.com/soedinglab/mmseqs2/wiki#downloading-databases for more details. I below copy some helpful information from the website that may address your issue.

[Create a seqTaxDB from an existing BLAST database] It is easy to create a seqTaxDB from a pre-existing local BLAST databases, if BLAST+ is installed. The following example creates an MMSeqs2 database from NCBI's nt database, but it also works with any of the other BLAST databases including the nr protein database.

First, manually download the NCBI taxonomy database dump:

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz mkdir taxonomy && tar -xxvf taxdump.tar.gz -C taxonomy BLAST+'s blastdbcmd can be used to extract both the FASTA as well as the taxonomy mapping files:

blastdbcmd -db nt -entry all > nt.fna blastdbcmd -db nt -entry all -outfmt "%a %T" > nt.fna.taxidmapping

Finally, the createdb and createtaxdb modules use the information to create a complete MMSeqs2 database:

mmseqs createdb nt.fna nt.fnaDB mmseqs createtaxdb nt.fnaDB tmp --ncbi-tax-dump taxonomy/ --tax-mapping-file nt.fna.taxidmapping

fanch1122 commented 8 months ago

Hello, thanks for your suggestion. However, during the homology_search, I encountered with another issue. "GDP_homology_search.pl -i C.fasta -o homolog_search -d /home/biolab_0/DB/mmseqDB names.dmp, nodes.dmp, merged.dmp from NCBI taxdump could not be found! mkdir: cannot create directory ‘mmseqs/’: File exists awk: fatal: cannot open file `homolog_search.split.1000bpbin.mmseqs.out' for reading: No such file or directory" It stuck there.

  1. I directly ran the DB after creating DB in MMseqs without indexing, does it works?
  2. There are a lot of files such as DB_h, DB.dbtype and so on, It confused me to use the exact file for DB in homology_search. which file is correct to use? Thank you and sorry for my silly questions!

Don't know how you solved this problem, I encountered a similar problem to yours, in fact I don't understand which step to build is correct. Because of the network interruption, my online creation of the NR library would be interrupted in 70% of the download process. The strategy I adopted was to download nr.gz locally and then upload it to the server. However, I have been unsure about the correct use of taxonmy-related files and NRBD_mapping files. . I'd love to know what a fully structured mmseq2 NRDB should look like.

stephen-14 commented 8 months ago

Hi, I tried to create [Create a seqTaxDB from an existing BLAST database] as the command about. However, I didn't run successfully because the step I didn't add -taxon_map prot.access2id when I create BLASTDB. I created it again, badly, my RAM (256 GB) is not enough to store all the databases. The NRDB contained lots of files like NR.index, etc., sorry I couldn't remember all, I think you should create a folder for it. If your RAM is large enough, I highly recommend using this command "mmseqs databases NR mmseqDB tmpDir", it is much easier. If you want to create your Database you can follow this command. However, add the -taxon_map in makeblastdb. Good lucks!

I think you did not correctly creat a taxonomy database for your mmseqDB. Please see mmseqs homepage https://github.com/soedinglab/mmseqs2/wiki#downloading-databases for more details. I below copy some helpful information from the website that may address your issue.

[Create a seqTaxDB from an existing BLAST database] It is easy to create a seqTaxDB from a pre-existing local BLAST databases, if BLAST+ is installed. The following example creates an MMSeqs2 database from NCBI's nt database, but it also works with any of the other BLAST databases including the nr protein database.

First, manually download the NCBI taxonomy database dump:

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz mkdir taxonomy && tar -xxvf taxdump.tar.gz -C taxonomy BLAST+'s blastdbcmd can be used to extract both the FASTA as well as the taxonomy mapping files:

blastdbcmd -db nt -entry all > nt.fna blastdbcmd -db nt -entry all -outfmt "%a %T" > nt.fna.taxidmapping

Finally, the createdb and createtaxdb modules use the information to create a complete MMSeqs2 database:

mmseqs createdb nt.fna nt.fnaDB mmseqs createtaxdb nt.fnaDB tmp --ncbi-tax-dump taxonomy/ --tax-mapping-file nt.fna.taxidmapping