dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
53 stars 1 forks source link

allow gzipped lineage files #214

Open taylorreiter opened 2 years ago

taylorreiter commented 2 years ago

The GTDB lineage files are stored as gzip compressed files. rule rule make_contigs_search_taxonomy_wc fails with a gzipped lineage file:

Traceback (most recent call last):
  File "/home/tereiter/github/2022-dominating-set-differential-abundance-example/.snakemake/conda/df16191f60f78adeb9f40112bb67409b/lib/python3.9/runpy.py", line 197, in _run_m
odule_as_main
    return _run_code(code, main_globals, None,
  File "/home/tereiter/github/2022-dominating-set-differential-abundance-example/.snakemake/conda/df16191f60f78adeb9f40112bb67409b/lib/python3.9/runpy.py", line 87, in _run_co
de
    exec(code, run_globals)
  File "/home/tereiter/github/2022-dominating-set-differential-abundance-example/.snakemake/conda/df16191f60f78adeb9f40112bb67409b/lib/python3.9/site-packages/charcoal/contigs_search_taxonomy.py", line 151, in <module>
    returncode = cmdline(sys.argv[1:])
  File "/home/tereiter/github/2022-dominating-set-differential-abundance-example/.snakemake/conda/df16191f60f78adeb9f40112bb67409b/lib/python3.9/site-packages/charcoal/contigs
_search_taxonomy.py", line 146, in cmdline
    return main(args)
  File "/home/tereiter/github/2022-dominating-set-differential-abundance-example/.snakemake/conda/df16191f60f78adeb9f40112bb67409b/lib/python3.9/site-packages/charcoal/contigs_search_taxonomy.py", line 27, in main
    tax_assign, _ = load_taxonomy_assignments(args.lineages_csv,
  File "/home/tereiter/github/2022-dominating-set-differential-abundance-example/.snakemake/conda/df16191f60f78adeb9f40112bb67409b/lib/python3.9/site-packages/sourmash/lca/com
mand_index.py", line 39, in load_taxonomy_assignments
    first_row = next(iter(r))
  File "/home/tereiter/github/2022-dominating-set-differential-abundance-example/.snakemake/conda/df16191f60f78adeb9f40112bb67409b/lib/python3.9/codecs.py", line 322, in decod
e
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

It would be super convenient to allow for gzipped lineage csv files.