Closed edgraham closed 3 years ago
Hi, thank you very much -- you are correct that the lineage files were out of date ;).
I didn't have time to do much more than run the quickstart myself, but I fixed what I believe to be the source of your problems over here -
https://github.com/dib-lab/charcoal/pull/170
please let me know if you get a chance to try it. I'll go through this issue more thoroughly in a bit (either this week or next).
thank you for your perseverance!
After moving forward from the last issue I posted I wanted to try out the example using the sourmash GTDB database and the 10 MAGs from Delmont et al. (as this is exactly what I want to ultimately do for my own data). So to start from the beginning of that example first it asks to install the sourmash database for GTDB with this command:
>
charcoal download-db
When I do that I end out with this output directory and contents:
Everything seems to work there. Following this I am able to download the example genomes and initiate a new project as indicated.
When I get to the "dry run" and try this command I get the following error:
Based on that error I realized the db files being referenced in `charcoal/conf/system.conf' don't match up with what is actually installed into the db directory when downloading the charcoal db at the beginning. So I made an initial assumption that the github is just a bit behind the active development so I went ahead and tried changing the db names in system.conf to look like this (I kept everything else the same in that file):
When I do that the 'dry run' works without an error so I then tried to run the actual analysis which landed me with a new error after appearing to successfully creating sourmash dbs for each of the 10 genomes:
Reading into the error message it looked like an issue with the gtdb lineages file (gtdb-release89-lineages.csv) and based on the "assumptions" error output I removed the filename column and just kept the accession tried to run again which got me this error:
This time I got a 'genbank_info' and 'genbank_genomes' directory that formed. The error seemed to be because it wasn't finding the genome file using the ftp address which I confirmed when I looked at the genbank info file and tried to do it myself, if you look at it there is a new version of that genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/585/GCF_000020585.1_ASM2058v1/
https://www.ncbi.nlm.nih.gov/assembly/GCF_000020585.3/
Now it seems like in the 'genbank_genomes.py' script that when it parses that accession id if there is no '.' it automatically assumes that it should be versioned as
.1
Running with that I then re-edited the file so that it had the full version information because the accession column didn't have the accession version info. Which landed me with another new error:
I hit a bit of a wall after this, best guess after going down the rabbit hole is that something in one portion of the background scripts is parsing the accessions to make a lineage dictionary without striping everything after the '.' in the accession and another part does strip it but I may be completely off there. This also seems to only be an issue in a situation where you are pulling genbank info for MAGs with a version that isn't '.1'.