automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.48k stars 1.27k forks source link

Do you want some datasets from chemistry for regression modeling? #950

Open UnixJunkie opened 3 years ago

UnixJunkie commented 3 years ago

Hi automl-hackers,

I feel this package could do some significant progress on the regression front.

I can send you some chemistry-related datasets from the real-world, if you are interested.

Also, I wonder if your thing could do better if an expert provides a rather good baseline model. Hence, your optimization procedure would know what kind of performance we are trying to beat.

Regards, F.

mfeurer commented 3 years ago

Hi Francois,

Yes, that sounds great. What format would those datasets be in? And would it possible to upload them to a public system such as openml.org?

Regarding the baseline model, it would be good to know what it is so we can see whether it is in our configuration space and whether they can be found at all?

If it's easier I'd also be happy if you drop me an email about this topic.

Cheers, Matthias

UnixJunkie commented 3 years ago

Hi Francois,

Yes, that sounds great. What format would those datasets be in?

That would be a CSV file. I.e. I will take care of the encoding/vectorization of molecules myself so that people can concentrate on the ML.

And would it possible to upload them to a public system such as openml.org?

Maybe, but that wouldn't be my priority.

Regarding the baseline model, it would be good to know what it is so we can see whether it is in our configuration space and whether they can be found at all?

Usually in my field, a RF with 100 trees makes a good baseline regressor, if the dataset is amenable to any regression modeling.

If it's easier I'd also be happy if you drop me an email about this topic.

Thanks!

Cheers, Matthias

UnixJunkie commented 3 years ago

I did play yesterday with auto-sklearn and pyCaret, both for regression. auto-sklearn gave an R2 near 0 (i.e. really did not work). pyCaret could reach R2 ~= 0.75; which is what I can reach with a hand-tuned model. But, pyCaret was horribly slow while my dataset was not so high-dimensional (it took overnight to start outputting some results). The dataset I was playing with had 12,000 entries and 4000 features per observation. Ok, maybe this is a little high dimensional, but the representation is sparse, so, I don't feel too bad about it.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.

UnixJunkie commented 3 years ago

Hello, The dataset has been released: https://zenodo.org/record/4588239 Many protein with many molecules and their docking scores. The goal of the task is to predict ligand docking scores. The baseline model was linearSVR w/ only C optimized. It performed well in most cases.

adfindlater commented 3 years ago

@UnixJunkie

encoding/vectorization of molecules

I'd love to learn more about how this is done. When I think of molecular features I think, multipole-moments, graph/connectivity features, etc. but I've never really looked into it. Could you be so kind as to recommend some literature?

UnixJunkie commented 3 years ago

@adfindlater in the chemoinformatics literature, look for the keywords: "chemical fingerprint" or "molecular fingerprint" or even "molecular descriptors". Popular fingerprints these days are ECFP4 or ECFP6 (available in rdkit). For the data-set I just pointed out, I used unfolded counted atom pairs (a sparse vector of positive integers). They were computed with this software: https://github.com/UnixJunkie/molenc

Some papers: