coleygroup / molpal

active learning for accelerated high-throughput virtual screening
MIT License
159 stars 36 forks source link

[QUESTION]: Small dataset for training #49

Open varys50 opened 6 months ago

varys50 commented 6 months ago

What are you trying to do? My training data is small (19 observations) and my lookup pool is only 1500 structures. I am wondering how I should set up the various parameters in the config file to account for this?

davidegraff commented 6 months ago

In some small experiments that were never published, we found that GP regressor with a matern-5/2 kernel on morgan/pair fingerprints works pretty well in very low-sample regimes. It's not in the repo, but if you're willing to implement it yourself, we also found that a Tanimoto kernel worked even better. One of the limitations with the GP (and why we never included its results in the original paper) is that it doesn't scale to very large pools without significant engineering and some approximations, but 1500 structures is more than small enough to quickly generate predictions even. At that scale, I'd also recommend going with batch_size=1.

varys50 commented 6 months ago

Thanks! Would I need to modify the args.py file to allow for different kernels?