algbio / ggcat

Compacted and colored de Bruijn graph construction and querying
MIT License
72 stars 10 forks source link

Color set -> color class compression #3

Closed rob-p closed 1 year ago

rob-p commented 1 year ago

Congratulations on the excellent manuscript! I have read it with great excitement and @jamshed will be presenting this paper in our lab's journal club soon. I was also very happy to see that you've chosen to implement ggcat in Rust.

I wanted to make brief mention of something that may be relevant to your color set representation. The concept of collapsing redundant color vectors into a minimal set of unique colors (i.e. color classes) and then representing them by their indices via an extra layer of indirection, is, depending on the specific data, a very powerful idea. This is essentially the same strategy adopted in rainbowfish and, later, mantis, and may be worth mention in your manuscript. Along these lines, I also wanted to convey an additional idea adopted in those tools that you may find useful to compress the color information even further.

In many types of data over which one wants to build a compacted colored dBG, not only do many k-mers have identical color sets (that is, they are equivalent under the color relation of k1 ~ k2 iff colors(k1) = colors(k2)), but the frequency distribution of the color sets themselves is highly skewed. That is, there are a small number of distinct color classes to which many k-mers map, and the distribution of frequencies of color class occurrences is often scale-free. This presents an additional opportunity to compress the color class labels themselves even further by choosing small labels (integers) for color ids that will be used frequently, and choosing larger labels (integers) for color ids that will be used infrequently. Specifically, as you build the information concerning the color sets, it is relatively easy to track the number of times a color ID has been used. Then, IDs can be re-ordered to assign smaller IDs to more frequent color sets and larger IDs to less frequent color sets. While this re-labeling can be done globally, it turns out that, in practice, the majority of the benefit comes from just assigning small IDs to the few most frequent color sets — this can be done at a relatively small cost by simply sampling the color sets for a small fraction of the k-mers. As long as the sampling is done randomly and unbiasedly, you'll expect to see all/most of the heavy hitters (probably many times) in your random sample, and you can assign IDs based on the frequency in this random sample (this is akin to the strategy we adopt to assign color IDs in mantis).

Anyway, congrats again on the very cool paper, and thanks for helping to drive rust adoption in bioinformatics! We look forward to exploring and using your tool.

jnalanko commented 1 year ago

Very nice paper indeed. For the record, "color class compression" is also used in my tool Themisto (supplement 1 of this paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8743562/).

Guilucand commented 1 year ago

Hi, thanks for the useful tips! I'll look at rainbowfish and mantis, also citing them in the paper. I'll investigate the possibility of adopting the optimized color index encoding as you proposed. Andrea