Errors when doing the full configure & run with the 10 MAGs from Delmont et al.

After moving forward from the last issue I posted I wanted to try out the example using the sourmash GTDB database and the 10 MAGs from Delmont et al. (as this is exactly what I want to ultimately do for my own data). So to start from the beginning of that example first it asks to install the sourmash database for GTDB with this command:

> charcoal download-db

When I do that I end out with this output directory and contents:


(charcoal) $ du -sh db
1.4G    db
(charcoal) $ ls db/
gtdb-release89-k31.sbt.zip  gtdb-release89-lineages.csv_

Everything seems to work there. Following this I am able to download the example genomes and initiate a new project as indicated.

When I get to the "dry run" and try this command I get the following error:

(charcoal) $ python -m charcoal run newproject.conf -n
** ERROR: database db/gtdb-r95.nucleotide-k31-scaled1000.sbt.zip does not exist.
** exiting.
SystemExit in line 77 of /home/charcoal/charcoal/Snakefile:
-1
  File "/home/charcoal/charcoal/Snakefile", line 77, in <module>
Error in snakemake invocation: Command '['snakemake', '-s', '/home/charcoal/charcoal/Snakefile', '--use-conda', '-j', '1', '-n', '--configfile', '/home/charcoal/charcoal/conf/defaults.conf', '/home/charcoal/charcoal/conf/system.conf', 'newproject.conf']' returned non-zero exit status 1.

Based on that error I realized the db files being referenced in `charcoal/conf/system.conf' don't match up with what is actually installed into the db directory when downloading the charcoal db at the beginning. So I made an initial assumption that the github is just a bit behind the active development so I went ahead and tried changing the db names in system.conf to look like this (I kept everything else the same in that file):

###
### installation/system-wide configuration - sysadmins, modify this :)
###

# sourmash query databases for contamination (SBTs, LCAs, or signatures)
gather_db:
- db/gtdb-release89-k31.sbt.zip
# lineages CSV containing reference lineages in query database.
# Must correspond to signatures in query databases (e.g. gtdb.csv).
lineages_csv: db/gtdb-release89-lineages.csv
....

When I do that the 'dry run' works without an error so I then tried to run the actual analysis which landed me with a new error after appearing to successfully creating sourmash dbs for each of the 10 genomes:

examining spreadsheet headers...
** assuming column 'accession' is identifiers in spreadsheet
** assuming column 'filename' is superkingdom in spreadsheet
** assuming column 'superkingdom' is phylum in spreadsheet
whoa, too many assumptions. are the headers right?
expecting identifiers,superkingdom,phylum,class,order,family,genus,species,strain
[Tue Feb  2 15:07:19 2021]
Error in rule make_contigs_taxonomy_json:
    jobid: 8
    output: output.newproject/stage1/TARA_PON_MAG_00061.fa.contigs-tax.json
    conda-env: /home/charcoal/.snakemake/conda/7a68231f
    shell:

        python -m charcoal.contigs_search_taxonomy             --genome example-genomes/TARA_PON_MAG_00061.fa --lineages-csv db/gtdb-release89-lineages.csv             --genome-sig output.newproject/stage1/TARA_PON_MAG_00061.fa.sig             --matches-sig output.newproject/stage1/TARA_PON_MAG_00061.fa.matches.sig             --json-out output.newproject/stage1/TARA_PON_MAG_00061.fa.contigs-tax.json             --match-rank order

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Activating conda environment: /home/charcoal/.snakemake/conda/7a68231f
examining spreadsheet headers...
** assuming column 'accession' is identifiers in spreadsheet
** assuming column 'filename' is superkingdom in spreadsheet
** assuming column 'superkingdom' is phylum in spreadsheet
whoa, too many assumptions. are the headers right?
expecting identifiers,superkingdom,phylum,class,order,family,genus,species,strain
[Tue Feb  2 15:07:20 2021]
Error in rule make_contigs_taxonomy_json:
    jobid: 40
    output: output.newproject/stage1/TARA_PSW_MAG_00046.fa.contigs-tax.json
    conda-env: /home/charcoal/.snakemake/conda/7a68231f
    shell:

        python -m charcoal.contigs_search_taxonomy             --genome example-genomes/TARA_PSW_MAG_00046.fa --lineages-csv db/gtdb-release89-lineages.csv             --genome-sig output.newproject/stage1/TARA_PSW_MAG_00046.fa.sig             --matches-sig output.newproject/stage1/TARA_PSW_MAG_00046.fa.matches.sig             --json-out output.newproject/stage1/TARA_PSW_MAG_00046.fa.contigs-tax.json             --match-rank order

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

== This is sourmash version 4.0.0a3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: example-genomes/TARA_RED_MAG_0... (k=31, DNA)
loaded 1 databases.

3 matches:
similarity   match
----------   -----
 32.0%       GCA_002471575 s__UBA830 sp002471575
  0.5%       GCA_002709525 s__UBA3470 sp002709525
  0.1%       GCA_002937455 s__UBA830 sp002937455
saving all matched signatures to "output.newproject/stage1/TARA_RED_MAG_00102.fa.matches.sig"
[Tue Feb  2 15:07:22 2021]
Finished job 14.
14 of 45 steps (31%) done

== This is sourmash version 4.0.0a3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: example-genomes/TARA_RED_MAG_0... (k=31, DNA)
loaded 1 databases.

0 matches:
similarity   match
----------   -----
saving all matched signatures to "output.newproject/stage1/TARA_RED_MAG_00119.fa.matches.sig"
[Tue Feb  2 15:07:24 2021]
Finished job 6.
15 of 45 steps (33%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/charcoal/.snakemake/log/2021-02-02T150709.848519.snakemake.log
Error in snakemake invocation: Command '['snakemake', '-s', '/home/charcoal/charcoal/Snakefile', '--use-conda', '-j', '1', '-j', '4', '--conf

Reading into the error message it looked like an issue with the gtdb lineages file (gtdb-release89-lineages.csv) and based on the "assumptions" error output I removed the filename column and just kept the accession tried to run again which got me this error:

[Tue Feb  2 15:29:56 2021]
Error in rule download_matching_genomes_one_by_one:
    jobid: 0
    output: genbank_genomes/GCF_000020585_genomic.fna.gz

RuleException:
HTTPError in line 425 of /home/charcoal/charcoal/Snakefile:
HTTP Error 404: Not Found
  File "/home/charcoal/charcoal/Snakefile", line 425, in __rule_download_matching_genomes_one_by_one
  File "/home/anaconda3/envs/charcoal/lib/python3.9/urllib/request.py", line 214, in urlopen
  File "/home/anaconda3/envs/charcoal/lib/python3.9/urllib/request.py", line 523, in open
  File "/home/anaconda3/envs/charcoal/lib/python3.9/urllib/request.py", line 632, in http_response
  File "/home/anaconda3/envs/charcoal/lib/python3.9/urllib/request.py", line 561, in error
  File "/home/anaconda3/envs/charcoal/lib/python3.9/urllib/request.py", line 494, in _call_chain
  File "/home/anaconda3/envs/charcoal/lib/python3.9/urllib/request.py", line 641, in http_error_default
  File "/home/anaconda3/envs/charcoal/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Exiting because a job execution failed. Look above for error message
** read 3 provided lineages
** config file checks PASSED!
** from here on out, it's all snakemake...
Job counts:
    count   jobs
    1   download_matching_genomes_one_by_one
    1
downloading genome for acc GCA_003215175/Acidimicrobiales bacterium AG-410-I20 from NCBI...
[Tue Feb  2 15:29:56 2021]
Finished job 121.
120 of 162 steps (74%) done
...wrote 463005 bytes to genbank_genomes/GCA_003215175_genomic.fna.gz
[Tue Feb  2 15:29:57 2021]
Finished job 158.
121 of 162 steps (75%) done
retrieved for GCF_000172635 - Alteromonas macleodii ATCC 27126
[Tue Feb  2 15:29:59 2021]
Finished job 115.
122 of 162 steps (75%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/charcoal/.snakemake/log/2021-02-02T152844.898859.snakemake.log
Error in snakemake invocation: Command '['snakemake', '-s', '/home/charcoal/charcoal/Snakefile', '--use-conda', '-j', '1', '-j', '4', '--configfile', '/home/charcoal/charcoal/conf/defaults.conf', '/home/charcoal/charcoal/conf/system.conf', 'newproject.conf']' returned non-zero exit status 1.

This time I got a 'genbank_info' and 'genbank_genomes' directory that formed. The error seemed to be because it wasn't finding the genome file using the ftp address which I confirmed when I looked at the genbank info file and tried to do it myself, if you look at it there is a new version of that genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/585/GCF_000020585.1_ASM2058v1/

https://www.ncbi.nlm.nih.gov/assembly/GCF_000020585.3/

Now it seems like in the 'genbank_genomes.py' script that when it parses that accession id if there is no '.' it automatically assumes that it should be versioned as .1

Running with that I then re-edited the file so that it had the full version information because the accession column didn't have the accession version info. Which landed me with another new error:

Activating conda environment: /home/charcoal/.snakemake/conda/7a68231f
examining spreadsheet headers...
** assuming column 'filename' is identifiers in spreadsheet
loaded 24706 tax assignments.
loaded 22 matches from 'output.newproject/stage1/TARA_PON_MAG_00061.fa.matches.sig'
Traceback (most recent call last):
  File "/home/charcoal/.snakemake/conda/7a68231f/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/charcoal/.snakemake/conda/7a68231f/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/charcoal/charcoal/contigs_search_taxonomy.py", line 130, in <module>
    returncode = cmdline(sys.argv[1:])
  File "/home/charcoal/charcoal/contigs_search_taxonomy.py", line 125, in cmdline
    return main(args)
  File "/home/charcoal/charcoal/contigs_search_taxonomy.py", line 75, in main
    lineage = tax_assign[ident]
KeyError: 'GCF_000730385'
[Tue Feb  2 16:08:16 2021]
Error in rule make_contigs_taxonomy_json:
    jobid: 17
    output: output.newproject/stage1/TARA_PON_MAG_00061.fa.contigs-tax.json
    conda-env: /home/charcoal/.snakemake/conda/7a68231f
    shell:

        python -m charcoal.contigs_search_taxonomy             --genome example-genomes/TARA_PON_MAG_00061.fa --lineages-csv db/gtdb-release89-lineages.v3.csv             --genome-sig output.newproject/stage1/TARA_PON_MAG_00061.fa.sig             --matches-sig output.newproject/stage1/TARA_PON_MAG_00061.fa.matches.sig             --json-out output.newproject/stage1/TARA_PON_MAG_00061.fa.contigs-tax.json             --match-rank order

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

== This is sourmash version 4.0.0a3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: example-genomes/TARA_PSW_MAG_0... (k=31, DNA)
loaded 1 databases.

2 matches:
similarity   match
----------   -----
  0.5%       GCA_002730445 s__MGIIb-O3 sp002730445
  0.1%       GCA_002497765 s__Poseidonia sp002497765
saving all matched signatures to "output.newproject/stage1/TARA_PSW_MAG_00136.fa.matches.sig"
[Tue Feb  2 16:08:17 2021]
Finished job 30.
6 of 45 steps (13%) done

== This is sourmash version 4.0.0a3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: example-genomes/TARA_ION_MAG_0... (k=31, DNA)
loaded 1 databases.

2 matches:
similarity   match
----------   -----
  0.8%       GCA_002707085 s__GCA-2707085 sp002707085
  0.1%       GCA_002694945 s__UBA9659 sp002694945
saving all matched signatures to "output.newproject/stage1/TARA_ION_MAG_00072.fa.matches.sig"
[Tue Feb  2 16:08:17 2021]
Finished job 26.
7 of 45 steps (16%) done

== This is sourmash version 4.0.0a3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: example-genomes/TARA_RED_MAG_0... (k=31, DNA)
loaded 1 databases.

0 matches:
similarity   match
----------   -----
saving all matched signatures to "output.newproject/stage1/TARA_RED_MAG_00119.fa.matches.sig"
[Tue Feb  2 16:08:17 2021]
Finished job 16.
8 of 45 steps (18%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/charcoal/.snakemake/log/2021-02-02T160806.680877.snakemake.log
Error in snakemake invocation: Command '['snakemake', '-s', '/home/charcoal/charcoal/Snakefile', '--use-conda', '-j', '1', '-j', '4', '--configfile', '/home/charcoal/charcoal/conf/defaults.conf', '/home/charcoal/charcoal/conf/system.conf', 'newproject.conf']' returned non-zero exit status 1.

I hit a bit of a wall after this, best guess after going down the rabbit hole is that something in one portion of the background scripts is parsing the accessions to make a lineage dictionary without striping everything after the '.' in the accession and another part does strip it but I may be completely off there. This also seems to only be an issue in a situation where you are pulling genbank info for MAGs with a version that isn't '.1'.

dib-lab / charcoal

Errors when doing the full configure & run with the 10 MAGs from Delmont et al. #169