Open sum732 opened 11 months ago
Hi SM,
Is the file library.fna
a FASTA-formatted file that contains all of your reference sequences, with the headers adhering to the format
kraken:taxid|<taxid>|<accession number>
?
If so, I believe you can build the database by running the command
perl buildDB.pl --DB /path/to/desired/database/location --FASTAs /path/to/your/library.fna --taxonomy /path/to/taxonomy/folder
Hello @mhajimorad,
Thank you for the quick response.
Yes, the library.fna
is a FASTA-formatted file that contains all reference seqs. However, there exists separate library.fna
FASTA formatted for each of bacteria, virus, plasmids etc. This is reflected in the input TxFile_NoPlasmid.txt
via paths as shown above in the example.
When I checked, there is one only group, plasmids
reference FASTA and it does not adhere to kraken:taxid|<taxid>|<accession number>
formatting. It only show Accession Numbers
What changes are needed to make this work. Many thanks, SM
Hi @sum732 ,
Please note my comments below pertain to attempting the perl buildDB.pl ...
approach I had referenced in my previous message. I do not have experience using the perl combineAndAnnotateReferences.pl ...
approach referenced in your original message.
To try to make the perl buildDB.pl ...
approach work, I suggest the following:
1. You mention you have multiple FASTA files. Assuming the headers for all the sequences contained therein adhere to the kraken:taxid|<taxid>|<accession number>
format, use the cat
command in Unix to concatenate all of these separate files into a single FASTA file. This single FASTA file is to be referenced as part of the --FASTAs
option of the perl buildDB.pl ...
command (see my previous message)
2. For your plasmids
FASTA sequences, you will need to eventually cat
the file to the single file produced as part of step 1.
BUT, I believe (I could be wrong) you will need to first update the respective headers so that they adhere to the kraken:taxid|<taxid>|<accession number>
format. You can probably write a script (in Python or another scripting language) to automate this process, assuming there is a file that associates accession numbers with NCBI taxid numbers.
Depending on your priority for including plasmids
, you may want to skip step 2 for now (especially if writing the script to update the plasmids'
headers is going to take a bit of time) and see if you can get the approach to work by simply using the single FASTA file that includes your other things (bacteria
, viruses
, etc.). If it works, then we know the workflow is good, and it is worthwhile to subsequently spend some time to resolve the plasmids
issue.
Hi @mhajimorad, thank you for the suggestions will try using buildD.pl
directly.
As per the help
it can take multifasta files. So trying that approach.
Hi @sum732, did things work out for you in the end?
Hi @AlexanderDilthey , Thanks for checking. Unfortunately no. Tried to focus on just bacteria
still wont create the file as needed.
Hello,
I am having issues in creating a custom database and need some help in resolving the issue. I found
https://github.com/DiltheyLab/MetaMaps/issues/5
to be helpful and following that tried to create a database. Snippet of the error shownSeveral taxon IDs specified in TxFile_NoPlasmid.txt are invalid, i.e. they don't appear in the taxonomy specified in ../Complete_Genomes_RefSeq_Bac_Vir_Fug_Pla_Pro/taxonomy/: kraken:taxid|1922674|NC_032554.1
However, this (and other) IDs do exits: names.dmpnodes.dmp
Command used
Here is peak into the inputFile List
I am not sure what is wrong. Any help will be much appreciated. Best, SM