apcamargo / ictv-mmseqs2-protein-database

23 stars 5 forks source link

Database virus_tax_db needs header information #7

Open ShailNair opened 6 months ago

ShailNair commented 6 months ago

Hi,

I followed the provided instructions and created the MMSeqs2 database using VMR_MSL38_v3 from ICTV. During the entire process, I did not receive any errors. However, when I execute the taxonomy assignment command I get the following error:

$mmseqs easy-taxonomy final.vcontigs.fixed.faa virus_tax_db ictv tmp \
> -e 1e-5 -s 6 --blacklist "" --tax-lineage 1 --threads 30

MMseqs Version:                         13.45111
ORF filter                              0
ORF filter e-value                      100
ORF filter sensitivity                  2
LCA mode                                3
Majority threshold                      0.5
Vote mode                               1
LCA ranks
................
Database virus_tax_db needs header information

My mapping file looks like this:

$head -n10 nr.virus.accession2taxid.ictv
102L_A  965
103L_A  965
104L_A  965
107L_A  965
108L_A  965
109L_A  965
110L_A  965
111L_A  965
112L_A  965
113L_A  965

and tax-dump directory:

$cd ictv-taxdump
$ ls -l -a | grep "^-" | awk '{print $9, $5}'
delnodes.dmp 0
merged.dmp 0
names.dmp 743021
nodes.dmp 942866

Note that the delnodes.dmp and merged.dmp are empty. The content of names.dmp and nodes.dmp:

$ head -n10 names.dmp
1       |       root    |               |       scientific name |
2       |       Hoswirudivirus MRV1     |               |       scientific name |
3       |       Shomudavirus limadaptatum       |               |       scientific name |
4       |       Moovirus moo    |               |       scientific name |
5       |       Sclerotimonavirus betaclarireediae      |               |       scientific name |
6       |       Potato virus H  |               |       scientific name |
7       |       Rhopapillomavirus 1     |               |       scientific name |
8       |       Monomorium pharaonis virus 1    |               |       scientific name |
9       |       Aquamavirus A   |               |       scientific name |
10      |       Orthorubulavirus hominis        |               |       scientific name |

$ head -n10 nodes.dmp
1       |       1       |       no rank |               |       8       |       0       |       1       |       0       |       0       |       0       |       0       |       0 |
                |
2       |       10641   |       species |       XX      |       0       |       1       |       11      |       1       |       0       |       1       |       1       |       0 |
                |
3       |       3162    |       species |       XX      |       0       |       1       |       11      |       1       |       0       |       1       |       1       |       0 |
                |
4       |       591     |       species |       XX      |       0       |       1       |       11      |       1       |       0       |       1       |       1       |       0 |
                |
5       |       13564   |       species |       XX      |       0       |       1       |       11      |       1       |       0       |       1       |       1       |       0 |
                |
6       |       2366    |       species |       XX      |       0       |       1       |       11      |       1       |       0       |       1       |       1       |       0 |
                |
7       |       11378   |       species |       XX      |       0       |       1       |       11      |       1       |       0       |       1       |       1       |       0 |
                |
8       |       12606   |       species |       XX      |       0       |       1       |       11      |       1       |       0       |       1       |       1       |       0 |
                |
9       |       7806    |       species |       XX      |       0       |       1       |       11      |       1       |       0       |       1       |       1       |       0 |
                |
10      |       7615    |       species |       XX      |       0       |       1       |       11      |       1       |       0       |       1       |       1       |       0 |
                |

Also, can the following information be extracted to a tsv/csv file, with protein_id from nr.virus.faa.gz and their corresponding ICTV accession.

Protein_id | Realm | Subrealm | Kingdom | Subkingdom | Phylum | Subphylum | Class | Subclass | Order | Suborder | Family | Subfamily | Genus | Subgenus | Species
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

Thank you