daenuprobst / molzip

The gzip classification method implemented for molecule classification.
MIT License
53 stars 10 forks source link

Why not try regression too? #1

Closed janweinreich closed 1 year ago

janweinreich commented 1 year ago

Hey I was wondering if we can use the same code to perform regression:

1) use distance the distances from gzip to create a distance matric 2) perform kernel-ridge regression similar to https://www.qmlcode.org/index.html

If it works for classification, I expect it to work for regression as well. Depends of course on the quantity too. For some quantities such as atomization energies I would suggest to first transform the molecule to a representation vector such as the Coulomb Matrix before pairwise compressing the representation vectors.

I can test and contribute this if there is interest. Would also be interested in collaborating!.

Best, Jan Weinreich PostDoc in LMCD @ EPFL

daenuprobst commented 1 year ago

Hey Jan... I just saw your message now. I implemented a fairly naive kNN regression (just taking the mean of the k nearest neighbours). It does an ok job compared to fairly basic baselines. See the results in the updated readme.

I'm very open to collaborate---feel free to add whatever you think would be cool :-).

janweinreich commented 1 year ago

thanks ! looks great. I will do some more careful benchmarking with the kernel ridge regression example too (see pull request). One thing that would be interesting is also to investigate if gzip lowers the offset of the learning curve or if it simply has a larger slope.

typically a lower offset means more information bias is included in the model architecture (eg through transfer learning or better representations). So studying learning curves might gives us a little more insight as well as to understanding what gzip does !