Open GGasch opened 7 months ago
Don't know.
It could be an option for pcalf-datasets to only download the last version of an assembly, but you might miss some ccyA+ genomes in this way.
What could be done on the pcalf-annotation part is the addition of a dereplication steps (dRep for example or InStrain) before CheckM / GTDB-Tk reducing calculation time (??? dRep itself can take a while so ....) for those steps...
Whatever, i agree with you, merging several versions of an assembly could reduce the complexity of the final dataset.
Could points.
Do you know whether a new version of genome means the older one is obsolete (ncbi seems to hint that), or not ? Because if it is the case we can focus only on the latest version.
I will try to implement dRep into the tools to have a more user friendly output.
When a genomes has several versions (for instance : GCF_003555505.1&GCA_003555505.2) should we merge the entry in the results of the analysis ?