DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
237 stars 73 forks source link

concat downloaded sequences #165

Open oatesa opened 5 years ago

oatesa commented 5 years ago

Fairy new to this, working though the manual, quick check: im interested in archaea, bacterial,viral and fungi I use the centrifuge-download -o library -m -d "archaea,bacteria,viral,fungi" refseq > seqid2taxid.map

to download ref sequences, the next step states concat all downloaded sequences into a single file- cat //.fna > input-sequences.fna

(1) would i do this for all archaea,bacteria,viral,fungi? or do this individually for each one?

If i also wanted to include vertebrate_mammalian using centrifuge-download -o library –m -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' >> seqid2taxid.map

(2) would this overwrite the contents of seqid2taxid.map or add to it?

Thanks in advance for any help

themouldinator commented 5 years ago

I'd suggest to go for the whole shebang and make an NCBI indexif you want all that! Using recentrifuge's rextract commands you can then parse out the taxa you dont want to see turn up really easily based on tax ID eg. -x 33630 -x 554915 -x 554296 -x 1401294 -x 193537 -x 3027 -x 33682 -x 207245 -x 38254 -x 2830 -x 2489521 -x 5752 -x 556282 -x 339960 -x 136087 -x 66288 -x 5719 -x 543769 -x 2763 -x 33634 -x 33090 -x 42452 -x 61964 takes out everything but fungi

its overkill for sure but so easy to use

feltzmc commented 5 years ago

I recommend doing the following steps to build a seqid2taxid.map file that will work with any refseq sequences you download:

wget https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz gunzip nucl_gb.accession2taxid.gz cut -d $'\t' -f 2,3 nucl_gb.accession2taxid > seqid2taxid.map