algbio / ggcat

Compacted and colored de Bruijn graph construction and querying
MIT License
72 stars 10 forks source link

Is there any way to dump the whole colormap in a readable file ? #39

Closed fingels closed 6 months ago

fingels commented 7 months ago

Hi,

I am aware of the command line ggcat dump-colors output.fasta.colors.dat that dump the colors associated to each individual file in a json file; but I was wondering whether there is an option to dump not only the colors associated to each file, but also those associated to the powerset of those colors.

Say, for instance, that some kmer is seen in files {1,2,4}, and that this subset {1,2,4} is associated to say color 8 in the colormap; in the final FASTA file this kmer would have header C:8:1. But without querying this kmer, I cannot know to which subset of colors corresponds 8.

So basically I am interested in accessing the colormap in plain. Is there any way this is possible ?

Guilucand commented 6 months ago

Hi, I added an experimental query_colormap api function, to do what you requested. The only caveat is that reading a single color set from the colormap is very slow (can take up to several ms), and it is much better to batch the queries in groups of subsets that have (almost) consecutive indices, especially for large datasets.

fingels commented 6 months ago

My intent was to batch the entire colormap (just like a matrix, if you will) so this should do the trick. Thanks!