abaizan / kodoja

Kodoja: identifying viruses from plant RNA sequencing data
MIT License
7 stars 6 forks source link

Provide minimal NCBI taxonomy names.dmp and nodes.dmp #12

Closed peterjc closed 6 years ago

peterjc commented 6 years ago

The test cases use just three viruses, which together with their parent nodes for the full lineage is only 10 taxonomy identifiers needed:

$ python filter_taxonomy.py 137758 946046 12227
Filtering NCBI taxonomy files nodes.dmp and names.dmp
Will create nodes_.dmp and names_.dmp using just the given
3 entries and their parent nodes.
Loaded 1692822 entries from nodes.dmp
Expanded 3 given TaxID to a list of 10 including ancestors
Created nodes_.dmp
Created names_.dmp

With these changes TravisCI will no longer download the full taxonomy, instead we provide this mini ten entry taxonomy under version control.

peterjc commented 6 years ago

As a bonus, the TravisCI runs are now much faster. I presume on top of avoiding the download and unzip, the smaller taxonomy also speeds up Kraken and Kaiju as well.