DiltheyLab / MetaMaps

Long-read metagenomic analysis
Other
96 stars 23 forks source link

Custom Database creation #74

Open sum732 opened 11 months ago

sum732 commented 11 months ago

Hello,

I am having issues in creating a custom database and need some help in resolving the issue. I found https://github.com/DiltheyLab/MetaMaps/issues/5 to be helpful and following that tried to create a database. Snippet of the error shown Several taxon IDs specified in TxFile_NoPlasmid.txt are invalid, i.e. they don't appear in the taxonomy specified in ../Complete_Genomes_RefSeq_Bac_Vir_Fug_Pla_Pro/taxonomy/: kraken:taxid|1922674|NC_032554.1 However, this (and other) IDs do exits: names.dmp

grep -w "1922674" ../Complete_Genomes_RefSeq_Bac_Vir_Fug_Pla_Pro/taxonomy/names.dmp
1922674 |       Beihai sipunculid worm virus 2  |               |       scientific name |

nodes.dmp

grep -w "1922674" ../Complete_Genomes_RefSeq_Bac_Vir_Fug_Pla_Pro/taxonomy/nodes.dmp 
1922674 |       1922348 |       species |       BS      |       9       |       1       |       1       |       1       |       0       |       1       |       1  |0       |               | 

Command used

perl /Path/MetaMaps-master/combineAndAnnotateReferences.pl --inputFileList TxFile_NoPlasmid.txt --outputFile MetaMapdb_Combined.fa --taxonomyInDirectory ../Complete_Genomes_RefSeq_Bac_Vir_Fug_Pla_Pro/taxonomy/ --taxonomyOutDirectory /Path/Kraken2_DB/MetaMapdb 

Here is peak into the inputFile List

head -n +3 TxFile_NoPlasmid.txt 
kraken:taxid|161|NZ_CP016056.1  ../Complete_Genomes_RefSeq_Bac_Vir_Fug_Pla_Pro/library/bacteria/library.fna
kraken:taxid|2702|NZ_LT629773.1 ../Complete_Genomes_RefSeq_Bac_Vir_Fug_Pla_Pro/library/bacteria/library.fna
kraken:taxid|573|NZ_AP024579.1  ../Complete_Genomes_RefSeq_Bac_Vir_Fug_Pla_Pro/library/bacteria/library.fna

I am not sure what is wrong. Any help will be much appreciated. Best, SM

mhajimorad commented 11 months ago

Hi SM,

Is the file library.fna a FASTA-formatted file that contains all of your reference sequences, with the headers adhering to the format kraken:taxid|<taxid>|<accession number> ?

If so, I believe you can build the database by running the command perl buildDB.pl --DB /path/to/desired/database/location --FASTAs /path/to/your/library.fna --taxonomy /path/to/taxonomy/folder

sum732 commented 11 months ago

Hello @mhajimorad,

Thank you for the quick response.

Yes, the library.fna is a FASTA-formatted file that contains all reference seqs. However, there exists separate library.fna FASTA formatted for each of bacteria, virus, plasmids etc. This is reflected in the input TxFile_NoPlasmid.txt via paths as shown above in the example.

When I checked, there is one only group, plasmids reference FASTA and it does not adhere to kraken:taxid|<taxid>|<accession number> formatting. It only show Accession Numbers

What changes are needed to make this work. Many thanks, SM

mhajimorad commented 11 months ago

Hi @sum732 ,

Please note my comments below pertain to attempting the perl buildDB.pl ... approach I had referenced in my previous message. I do not have experience using the perl combineAndAnnotateReferences.pl ... approach referenced in your original message.

To try to make the perl buildDB.pl ... approach work, I suggest the following:

1. You mention you have multiple FASTA files. Assuming the headers for all the sequences contained therein adhere to the kraken:taxid|<taxid>|<accession number> format, use the cat command in Unix to concatenate all of these separate files into a single FASTA file. This single FASTA file is to be referenced as part of the --FASTAs option of the perl buildDB.pl ... command (see my previous message)

2. For your plasmids FASTA sequences, you will need to eventually cat the file to the single file produced as part of step 1. BUT, I believe (I could be wrong) you will need to first update the respective headers so that they adhere to the kraken:taxid|<taxid>|<accession number> format. You can probably write a script (in Python or another scripting language) to automate this process, assuming there is a file that associates accession numbers with NCBI taxid numbers.

Depending on your priority for including plasmids, you may want to skip step 2 for now (especially if writing the script to update the plasmids' headers is going to take a bit of time) and see if you can get the approach to work by simply using the single FASTA file that includes your other things (bacteria, viruses, etc.). If it works, then we know the workflow is good, and it is worthwhile to subsequently spend some time to resolve the plasmids issue.

sum732 commented 11 months ago

Hi @mhajimorad, thank you for the suggestions will try using buildD.pl directly. As per the help it can take multifasta files. So trying that approach.

AlexanderDilthey commented 9 months ago

Hi @sum732, did things work out for you in the end?

sum732 commented 8 months ago

Hi @AlexanderDilthey , Thanks for checking. Unfortunately no. Tried to focus on just bacteria still wont create the file as needed.