daenuprobst / molzip

The gzip classification method implemented for molecule classification.
MIT License
54 stars 9 forks source link

added kernel ridge regression #7

Closed janweinreich closed 1 year ago

janweinreich commented 1 year ago

added functions for kernel ridge regression with Laplacian kernel in gzip_regressor.py

compute_pairwise_ncd computed normalized compression density

compute_ncd enter multiprocessing

train_kernel_ridge_regression it trains...

predict_kernel_ridge_regression1 well it predicts...

For the datasets provided seems to perform about as well as KNN but I did not yet carefully try different hyperparameters.

Planing to test this for larger datasets such as QM9 as well as other types of molecular representations (e.g. binned numerical representations, or simply rdkit fingerprint, instead of string based representations). For the datasets benchmark here I expect Rdkit FP to be decent!

janweinreich commented 1 year ago

if you wish to include this, we should introduce different regression tasks, so there is no conflict between the KNN regressor and kernel-ridge regression

daenuprobst commented 1 year ago

Hey... Looks neat (I just merged it). I think the trick to substantially improve performance will be in the encoding/representation of the input. But I still had no luck in beating plain old non-preprocessed SMILES (except for the kNN regression where there are slight improvements with tokenization).