Open ccbaumler opened 4 months ago
hot take: we have to trust that the creation of the sketches was done properly. But, if we allow flexibility in that step (good!) then we suffer from not being able to provide good checks later on (bad, but bearable). So this is a consequence of that flexibility.
The alternative solution here is therefore multifactorial:
One possible solution:
not a bad idea, but (if we do it) I would like to make it optional. Tracking multiple synchronized output files in a workflow is annoying.
anyway, I am not convinced this is a major problem at the moment. I would prefer to figure out which use cases we're actually going to focus on and then realign CLI UX around that (which may well include this issue ;)).
100% agree with you! I am going to start applying this work to colorectal cancer meta genomes which could indicate areas of improvement.
The abundance of a hash in a pangenome sketch is the amount of genomes that contain that hash per lineage.
When we get to creating the rank table, these sketches may be merged. The pangenome element the had is designated is derived from the abundance value for that hash per its lineage. Therefore, we should recalculate the count if more than one lineage is being placed in a rank table...