Closed jschrier closed 1 year ago
I have become involved in this work and completely missed this comment, sorry!
Thank you this is a very nice analogy to connect and motivate why Gzip indeed works well for chemical problems. It would also be super interesting to correlate the MA number versus the compression density we can achieve with Gzip.
Do you think one could also build an ML representation that only looks at the bonds identified with the MA algorithm and checks for redundancy - similar to the patterns of the LZ vocabulary? So essentially performing the compression manually, only permitting a vocabulary inspired by chemistry ("bonds" or higher order "fragments")?
Because would interesting to see how much the patterns identified by Gzip (which might not be "chemistry intuitive") add to the accuracy.
All good questions; no obvious answer is known to me....
Yes indeed there is a nice linear correlation between molecular assembly number and the size of the gzip compressed SMILES string :) - even thought it looks a little noisy the trend is there :)
(took the MA number data from https://www.nature.com/articles/s41467-021-23258-x#code-availability figure 2 )
I came across your Twitter post today and had some notes kicking around on my laptop that might be relevant background literature to your project...and perhaps a connection to a broader literature.
tl;dr "Assembly Theory" is just Lemepl-Ziv-Welchcomplexity / compression.