daenuprobst / molzip

The gzip classification method implemented for molecule classification.
MIT License
54 stars 9 forks source link

Relationship between compression and Lee Cronin's "Assembly Theory" #14

Closed jschrier closed 1 year ago

jschrier commented 1 year ago

I came across your Twitter post today and had some notes kicking around on my laptop that might be relevant background literature to your project...and perhaps a connection to a broader literature.

tl;dr "Assembly Theory" is just Lemepl-Ziv-Welchcomplexity / compression.

  1. Cronin and co-workers have put forward a Molecular Assembly method for describing molecular complexity (Nat Comm 2020 10.1038/s41467-021-23258-x )
    1. They define an "assembly pathway" arxiv:1907.04649 arXiv:1804.06972 as "sequences of joining operations, that start with basic building blocks (in this case bonds) and end with a final product. In these sequences, sub-units generated within the sequence can combine with other basic or compound sub-units later in the sequence, to recursively generate larger structures ." [Cronin Nat Comm 2020] and a "molecular assembly number" (MA) as the length of the shortest of these pathways.
  2. This is equivalent to the Lempel-Ziv complexity measure Lempel Ziv '76], in which a source string is decomposed to a set of patterns which suffice to reconstruct the source string by copying and insertion operations.
    1. The patterns (LZ vocabulary) are the "assembly pool" in the terminology of Cronin & co. https://dx.doi.org/10.26434/chemrxiv.13476597.v2
    2. Cronin & co. are aware of the relationship to the Lempel-Ziv-Welch compression algorithm (it is alluded to in Ref. https://arxiv.org/abs/1907.04649 as its relationship to the Kolmogorov complexity)
    3. Cronin & co. (correctly) reject Kolmogorov complexity as being uncomputable.
  3. Many string representations of molecules exist (e.g., SELFIES, SMILES, InCHI)—A challenge for SELFIES and SMILES is the lack of single intrinsic canonical order; in other words, there are multiple possible string representations, and in the computation of MA one is free to choose the one that leads to the shortest possible LZ complexity.
janweinreich commented 1 year ago

I have become involved in this work and completely missed this comment, sorry!

Thank you this is a very nice analogy to connect and motivate why Gzip indeed works well for chemical problems. It would also be super interesting to correlate the MA number versus the compression density we can achieve with Gzip.

Do you think one could also build an ML representation that only looks at the bonds identified with the MA algorithm and checks for redundancy - similar to the patterns of the LZ vocabulary? So essentially performing the compression manually, only permitting a vocabulary inspired by chemistry ("bonds" or higher order "fragments")?

Because would interesting to see how much the patterns identified by Gzip (which might not be "chemistry intuitive") add to the accuracy.

jschrier commented 1 year ago

All good questions; no obvious answer is known to me....

janweinreich commented 1 year ago

Yes indeed there is a nice linear correlation between molecular assembly number and the size of the gzip compressed SMILES string :) - even thought it looks a little noisy the trend is there :)

ma_vs_encoded_size

(took the MA number data from https://www.nature.com/articles/s41467-021-23258-x#code-availability figure 2 )