dataset selection - Githubissues

yuanqing-wang commented 4 years ago

we're using ESOL dataset to start with. let's discuss what datasets to use here.

ideally, we would want our dataset to be

regression tasks for now?
solely dependent on graph: (therefore I would vote against something like QM9 and friends, geometry complicates things)
enable us to have out-of-distribution data

karalets commented 4 years ago

Thank you for starting this issue!

In particular, I am interested in the following few details.

We are currently using ESOL I assume to do purely supervised learning on measurements for particular graphs. Can we discuss training-test-set sizes here etc?
What 'background' datasets would we consider using if we were to try semi-supervised learning which partially informs the graph more than the ESOL training set?
In. the cambridge paper they also have foreground task and background datasets, can we discuss their datasets as well?
In the task we care about primarily down the line, we anticipate being in a regime where we have few measurements and want to trade off cost of generating a measurement with information theoretical quantities. Can we make a specific pitch for a first loop we could have here of a dataset we could use to consider having some given measurements and doing active learning to pick molecules to get better at held out data? What background data would be relevant here?

I would like to be extremely specific here so we can plan an experiment section for a V0 of a tech report to guide our current explorations closely to that loop.

Many thanks, @yuanqing-wang !

yuanqing-wang commented 4 years ago

We are currently using ESOL I assume to do purely supervised learning on measurements for particular graphs. Can we discuss training-test-set sizes here etc?

ESOL has 1128 molecular graphs with average number of nodes around 20.

What 'background' datasets would we consider using if we were to try semi-supervised learning which partially informs the graph more than the ESOL training set?

We can use ZINC or [Enamine Real] (https://enamine.net/library-synthesis/real-compounds/real-database), or the subset thereof to represent the synthesizable space of (druglike) organic small molecules.

In. the cambridge paper they also have foreground task and background datasets, can we discuss their datasets as well?

They used: FreeSolv, Melting, ESOL, CatS, Malaria, p450. Shouldn't be hard to add APIs to grab and import these.

In the task we care about primarily down the line, we anticipate being in a regime where we have few measurements and want to trade off cost of generating a measurement with information theoretical quantities. Can we make a specific pitch for a first loop we could have here of a dataset we could use to consider having some given measurements and doing active learning to pick molecules to get better at held out data? What background data would be relevant here?

To keep comparisons fair and simple, I'd suggest partition within a same dataset to be foreground and background. Does that sound reasonable?

karalets commented 4 years ago

We are currently using ESOL I assume to do purely supervised learning on measurements for particular graphs. Can we discuss training-test-set sizes here etc?

ESOL has 1128 molecular graphs with average number of nodes around 20.

It is that small? 1128 samples is not nothing, but not 'much'. How much data for just graph structures can we get from other related datasets?

What 'background' datasets would we consider using if we were to try semi-supervised learning which partially informs the graph more than the ESOL training set?

We can use ZINC or [Enamine Real] (https://enamine.net/library-synthesis/real-compounds/real-database), or the subset thereof to represent the synthesizable space of (druglike) organic small molecules.

As asked above, how big would this be?

In. the cambridge paper they also have foreground task and background datasets, can we discuss their datasets as well?

They used: FreeSolv, Melting, ESOL, CatS, Malaria, p450. Shouldn't be hard to add APIs to grab and import these.

Great. Sizes? Relevance to tasks we may want to solve? Ideally we would build towards a piepleine that has some relevance for the covid tasks John suggested in the slack a while ago.

In the task we care about primarily down the line, we anticipate being in a regime where we have few measurements and want to trade off cost of generating a measurement with information theoretical quantities. Can we make a specific pitch for a first loop we could have here of a dataset we could use to consider having some given measurements and doing active learning to pick molecules to get better at held out data? What background data would be relevant here?

To keep comparisons fair and simple, I'd suggest partition within a same dataset to be foreground and background. Does that sound reasonable?

The problem is, none of this will be 'out of distribution' and we don't really know if this is fair 'background data', as the Cambridge paper discussed that they needed to stratify the unsupervised data to get good representations.

This is going to be a major part of the experimental design here, but for now we can set up an experimental loop which has flags for what all these objects are, passes the related dataloaders accordingly and we can set them to other datasets later.

But I would really like to create a real full example of datasets that we would consider publishable material that we run things on now.

yuanqing-wang commented 4 years ago

Is that small?

Welcome to the world of molecular machine learning. The rest of the dataset that they used are not dramatically larger either: FreeSolv has 650 data points. Well the rest looks like they're property names rather than specific dataset names. And depends on where you get them the size may vary.

But this is generally true: (data, measurement) pairs dataset in molecular ML is either small or unreliable. Each entry costs money and time. If you have enough money and time you're probably a pharama company and therefore wouldn't be excited in the idea of sharing data.

The exceptions are QM9 dataset and friends, which are quantum physical data but they depend (to various extent) on the geometry of the graph, rather than the topology alone.

yuanqing-wang commented 4 years ago

Ways to provide out-of-distribution data: we can partition the datasets by the time the compound is developed, the scaffold it contains, etc.

Like they did here:

PotentialNet for Molecular Property Prediction https://doi.org/10.1021/acscentsci.8b00507

karalets commented 4 years ago

Cool. I am sure John can add more color here for variants we should care about, but I think this provides enough background information to get started (or keep working) on the experiments with ESOL.

yuanqing-wang commented 4 years ago

Speaking of datasets @jchodera may like...

I guess it would be at cool, or at least topical, to use the data harvested in COVID moonshot project.

https://postera.ai/covid/activity_data

It's nice that

it's 370 molecules now and counting
measurements are with error bars

but not all molecules have the same type of measurements.

karalets commented 4 years ago

Great, this could be useful.

choderalab / pinot

dataset selection #29