antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
13 stars 4 forks source link

haystac using full nt database #14

Closed slennart closed 1 year ago

slennart commented 2 years ago

I'm interested in trying haystac and comparing it with the assignments I get using kraken2. So far I have been using the full nt database so I'd like to build a comparable haystac database. Is this feasible and could you kindly guide me how to do this?

antonisdim commented 1 year ago

Hello,

I hope you are well and I am really sorry for having missed this until now !

The underlying approach of haystac database when constructing a database is to have one genome/assembly per taxon/species. If the user provides their own NCBI query then the largest genome is chosen, and when it comes to taxa in the Representative Refseq database, the first accession per species is chosen (from the NCBI genome reports). This default behaviour is overridden if the user either specifies a specific accession number or sequence fasta file for a given taxon (or taxa).

Due to the above reasons haystac database does not currently support the construction of a database from the full NCBI's nt database automatically. Although it doesn't mean that it is impossible to adapt the full nt to be used by haystac database. The best approach I can think of right now is to create one fasta file for each taxon included in the nt database, which will contain all the corresponding fasta records that areassigned to said taxon. Then by using haystac database --sequences-file <nt-db-file.tsv> a haystac compatible database will be built, and each taxon will be containing all sequences that are available for it in the nt database.

I recognise that this approach might be a bit laborious as it requires you to create a custom fasta file for each taxon in the nt, so I could possibly suggest a simpler approach if you would like to compare the performance between kraken2 and haystac. You could use haystac database --refseq-rep prokaryote_rep (or any other combination of flags) to construct a database with the species composition of the Representative RefSeq and then use the fasta files under the database output directory <haystac_db_output_directory>/bowtie/chunk*.fasta.gz to create a custom kraken2 database (after concatenating all the fasta files under hat path into a single one).

I hope either of the above two suggestions work for you and of course if you have any more questions feel free to ask them.

Thank you for your patience ! Antony