Closed edgraham closed 3 years ago
Hi, thank you very much -- you are correct that the lineage files were out of date ;).
I didn't have time to do much more than run the quickstart myself, but I fixed what I believe to be the source of your problems over here -
please let me know if you get a chance to try it. I'll go through this issue more thoroughly in a bit (either this week or next).
thank you for your perseverance!
After moving forward from the last issue I posted I wanted to try out the example using the sourmash GTDB database and the 10 MAGs from Delmont et al. (as this is exactly what I want to ultimately do for my own data). So to start from the beginning of that example first it asks to install the sourmash database for GTDB with this command:
charcoal download-db
When I do that I end out with this output directory and contents:
Everything seems to work there. Following this I am able to download the example genomes and initiate a new project as indicated.
When I get to the "dry run" and try this command I get the following error:
Based on that error I realized the db files being referenced in `charcoal/conf/system.conf' don't match up with what is actually installed into the db directory when downloading the charcoal db at the beginning. So I made an initial assumption that the github is just a bit behind the active development so I went ahead and tried changing the db names in system.conf to look like this (I kept everything else the same in that file):
When I do that the 'dry run' works without an error so I then tried to run the actual analysis which landed me with a new error after appearing to successfully creating sourmash dbs for each of the 10 genomes:
Reading into the error message it looked like an issue with the gtdb lineages file (gtdb-release89-lineages.csv) and based on the "assumptions" error output I removed the filename column and just kept the accession tried to run again which got me this error:
This time I got a 'genbank_info' and 'genbank_genomes' directory that formed. The error seemed to be because it wasn't finding the genome file using the ftp address which I confirmed when I looked at the genbank info file and tried to do it myself, if you look at it there is a new version of that genome:
Now it seems like in the '' script that when it parses that accession id if there is no '.' it automatically assumes that it should be versioned as
Running with that I then re-edited the file so that it had the full version information because the accession column didn't have the accession version info. Which landed me with another new error:
I hit a bit of a wall after this, best guess after going down the rabbit hole is that something in one portion of the background scripts is parsing the accessions to make a lineage dictionary without striping everything after the '.' in the accession and another part does strip it but I may be completely off there. This also seems to only be an issue in a situation where you are pulling genbank info for MAGs with a version that isn't '.1'.