dib-lab / sourmash_plugin_pangenomics

tools for sourmash-based pangenome analyses
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Does rank table need to recalculate hash counts when merging pangenome sketches? #13

Open ccbaumler opened 4 months ago

ccbaumler commented 4 months ago

The abundance of a hash in a pangenome sketch is the amount of genomes that contain that hash per lineage.

When we get to creating the rank table, these sketches may be merged. The pangenome element the had is designated is derived from the abundance value for that hash per its lineage. Therefore, we should recalculate the count if more than one lineage is being placed in a rank table...

ctb commented 4 months ago

hot take: we have to trust that the creation of the sketches was done properly. But, if we allow flexibility in that step (good!) then we suffer from not being able to provide good checks later on (bad, but bearable). So this is a consequence of that flexibility.

The alternative solution here is therefore multifactorial:

ccbaumler commented 4 months ago

One possible solution:

  1. using the csv output when making the database to get a total count for each lineage
  2. normalizing the abundance
  3. Artificially inflate the abundances to fit correctly
  4. Celebrate
ctb commented 4 months ago

not a bad idea, but (if we do it) I would like to make it optional. Tracking multiple synchronized output files in a workflow is annoying.

anyway, I am not convinced this is a major problem at the moment. I would prefer to figure out which use cases we're actually going to focus on and then realign CLI UX around that (which may well include this issue ;)).

ccbaumler commented 4 months ago

100% agree with you! I am going to start applying this work to colorectal cancer meta genomes which could indicate areas of improvement.