dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
52 stars 1 forks source link

updating default scaled to 2000 caused new value error #217

Closed taylorreiter closed 2 years ago

taylorreiter commented 2 years ago

215 changed the scaled value in charcoal/conf/system.conf.

@ccbaumler is using charcoal trying to run the workflow in this repo: https://github.com/taylorreiter/2022-dominating-set-differential-abundance-example, and charcoal produces the following error:

[Thu Jun 23 09:15:15 2022]
rule make_contigs_search_taxonomy_wc:
    input: outputs/query_genomes/GCF_008121495.1_genomic.fna.gz, outputs/query_genomes_charcoal/stage1/GCF_008121495.1_genomic.fna.gz.sig, outputs/query_genomes_charcoal/stag
e1/GCF_008121495.1_genomic.fna.gz.matches.csv, inputs/gtdb-rs207.taxonomy.csv, inputs/gtdb-rs207.genomic-reps.dna.k31.zip
    output: outputs/query_genomes_charcoal/stage1/GCF_008121495.1_genomic.fna.gz.contigs-tax.json
    jobid: 8
    wildcards: g=GCF_008121495.1_genomic.fna.gz

Activating conda environment: /home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0
examining spreadsheet headers...
** assuming column 'ident' is identifiers in spreadsheet
loaded 317542 tax assignments.
loaded 341 matches from 'outputs/query_genomes_charcoal/stage1/GCF_008121495.1_genomic.fna.gz.matches.csv'
Traceback (most recent call last):
  File "/home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0/lib/python3.9/site-packages/charcoal/contigs_search_taxonomy.py", line 159, in <module>
    returncode = cmdline(sys.argv[1:])
  File "/home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0/lib/python3.9/site-packages/charcoal/contigs_search_taxonomy.py", line 154, in cmdline
    return main(args)
  File "/home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0/lib/python3.9/site-packages/charcoal/contigs_search_taxonomy.py", line 62, in main
    if genome_sig.similarity(ss) == 1.0:
  File "/home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0/lib/python3.9/site-packages/sourmash/signature.py", line 136, in similarity
    return self.minhash.similarity(other.minhash,
  File "/home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0/lib/python3.9/site-packages/sourmash/minhash.py", line 692, in similarity
    return self._methodcall(lib.kmerminhash_similarity,
  File "/home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0/lib/python3.9/site-packages/sourmash/utils.py", line 25, in _methodcall
    return rustcall(func, self._get_objptr(), *args)
  File "/home/baumlerc/2022-ddaV2/2022-dominating-set-differential-abundance-example/.snakemake/conda/8109b21a360616fa66475257ae47c1b0/lib/python3.9/site-packages/sourmash/utils.py", line 78, in rustcall
    raise exc
ValueError: mismatch in scaled; comparison fail

So this caused a problem for contigs_search_taxonomy.py run in rule make_contigs_search_taxonomy_wc. Woops. Hunting it down now.

taylorreiter commented 2 years ago

rule make_contigs_search_taxonomy_wc: doesn't have a command line scaled parameter, but is fed by outputs from rule prefetch_all_matches_wc: which does have gather_scaled = config['gather_scaled'], as a param.

Trying to figure out what could be causing the mismatch, I searched for 1000 in the repo to see if that scaled value is lingering anywhere.

Not really relevant, but this line of code might need to be changed to *scaled instead of *1000

https://github.com/dib-lab/charcoal/blob/ac44310e0c1c1339979153e008cb327648e590a2/charcoal/compare_taxonomy.py#L29

taylorreiter commented 2 years ago

The scaled value seems to be inherited in all of the correct places, so I checked the sourmash docs, and the rs207 dbs are scaled 1000 🙄

https://github.com/sourmash-bio/sourmash/blob/6c7b3a82e4c9ef8b7aac8823ee8863012be89773/doc/databases.md#types-of-databases

The line of code that is causing this problem is here:

https://github.com/dib-lab/charcoal/blob/latest/charcoal/contigs_search_taxonomy.py#L59-#L66

I'll update the default scaleds back to 1000.

taylorreiter commented 2 years ago

closed with #218