choderalab / pinot

Probabilistic Inference for NOvel Therapeutics
MIT License
15 stars 2 forks source link

Overall loop for training a deep net for molecules here #3

Open karalets opened 4 years ago

karalets commented 4 years ago

We have two desiderata:

  1. We want to be able to learn a network which regresses to measurements given a structure as input.
  2. We may want to pretrain parts of that network (i.e. the molecular representation part) with existing molecular data in order to get some knowledge into the model about what molecular structures exist.

We can decompose those two tasks as follows:

We want to have a model of a representation P(h|x) which predicts hidden features h from a molecular graph x, ideally a joint model P(h,x) that is a joint density with arbitrary conditioning.

We furthermore want a model of the measurements m we care about given a molecular representation h, expressed asP(m|h). In simple regression this could be a probabilistic linear model on top of the outputs of the representation model P(h|x).

We may want to train them both jointly, separately, or in phases. If we pre-train P(h|x), that is called semi-supervised learning.

In a training loop to solve the task of regressing m from a training set D_t = {X_t, M_t}, we may want to account for having access to a background dataset D_b = {X_b} without measurements but with molecular graphs.

The desired training loop now allows us to potentially pre-train or jointly train a model which can learn from both sources of data.

Our targeted output is a model P(m|x) = int_h P(m|h) P(h|x) dh that is applied to a test set and works much better after having ingested all data available to us.

In this issue/thread, I suggest we link to the code and discuss how to create this loop concretely based on a concrete application example with molecules.

Missing pieces:

karalets commented 4 years ago

@yuanqing-wang Can you please comment on how this matches your thoughts and if not what we should change in the overall desiderata?

Then we can talk about how you intend to or already have structured this.

yuanqing-wang commented 4 years ago

@karalets

I like your idea of the overall structure. Separating x -> h and h -> m sounds like a reasonable thing to do since we can play with each part afterwards.

Meanwhile for training and testing I guess we only need something as simple as functions that take models, weights, and return metrics. I'll put stuff here: https://github.com/choderalab/pinot/blob/master/pinot/app/utils.py to incorporate both x -> h and h -> m.

karalets commented 4 years ago

@yuanqing-wang what do you mean by weights? I would abstract away from weights and think about model parameters or model posteriors quite independently, that should be up to a model class to decide what it wants to serialize to a first approximation.

And in utils I don't see the concrete link to dataset creation, i.e. 20% or whatnot. I suggest also accounting for: validation set, data loaders, an interface for passing an arbitray model class with a defined API into the trainer/tester, ...

I would prefer if that function became a thing which has a specified API for this problem which we can pass models into that conform to an API and then push a button and get a few metrics, i.e. test log likelihood.

Could you build a loop an an experiment file which, for the simplest off-the-shelf model does that and does the entire thing?

i.e. it would be good if we get to an abstraction that allows us to define an experiment as follows or similar, my main point is to modularize heavily.

def experiment_1(args):
     model = Model1
     dataset_train = ...
     dataset_background = ....
    hyperparameters = args....
    out_path = args.experiment_path

    #if this is semi-supervised do this
    #if it were not semi-supervised there could also be a run_experiment(...) that only does thhe other 
    stuff or so

    results = run_ss_experiment(...)
    plot_figures(results, out_path)

And in order to test that there should be from the beginning a concrete instance of such an experiment that one can run.

yuanqing-wang commented 4 years ago

the data utils are supplied separately here https://github.com/choderalab/pinot/blob/master/pinot/data/utils.py

karalets commented 4 years ago

Cool can we have an experiment file that brings everything together and executes a full example of it all similarly to what I described above?

yuanqing-wang commented 4 years ago

Working on it

karalets commented 4 years ago

@karalets

I like your idea of the overall structure. Separating x -> h and h -> m sounds like a reasonable thing to do since we can play with each part afterwards.

Meanwhile for training and testing I guess we only need something as simple as functions that take models, weights, and return metrics. I'll put stuff here: https://github.com/choderalab/pinot/blob/master/pinot/app/utils.py to incorporate both x -> h and h -> m.

Just to clarify: I do suggest separating them not necessarily in the model, but rather accounting for the existence of both, so that maybe they are trained separately, maybe jointly, but in any case they need to have a consistent API for the data each part needs to see. In fact, I believe building both into a joint model will work best, but we still need to have datasets in there that can supervise each aspect.

Consider this an instance of data oriented programming, rather than a deep learning model with different phases.

yuanqing-wang commented 4 years ago

@karalets

I incorporated your idea here: https://github.com/choderalab/pinot/blob/master/pinot/net.py

and the training pipeline now looks like https://github.com/choderalab/pinot/blob/master/pinot/app/train.py

like me know your thoughts on this

karalets commented 4 years ago

@karalets

I incorporated your idea here: https://github.com/choderalab/pinot/blob/master/pinot/net.py

and the training pipeline now looks like https://github.com/choderalab/pinot/blob/master/pinot/app/train.py

like me know your thoughts on this

Great start!

In the Net class I would make the representation and parametrization objects concrete. I.e. you can play the inheritance game and create explicit classes that inherit from Net and have a concrete form. Else you do not win much here. I would also suggest not calling the top layer parametrization, but rather something like regressor_layer or measurement model or whatever as opposed to the other component, the more lucid representation_layer or representation that you currently use; parametrization is pretty misleading as a name.

Regarding the loop: I would still recommend factoring out an experiment class which has some more modularity.

I.e. in your current loop you do a lot of things in one larger script: defining the model layers, building the model, training, etc... In a better universe training and experiment setup are factorized out.

Currently, also, unlike the suggestion above, you do not have the potential for semi-supervised learning in there even if you wanted to do use it.

Think about wanting to define an experiment which can differ in the following ways:

Your experiment runner and trainer etc. should make such changes easy and clear, I suggest you think backwards from the results you anticipate wanting to be able to get to the structure here.

As I said, I recommend factoring things out a bit more than you have, ut this is surely a good direction

karalets commented 4 years ago

One can also factor the loop out into:

experiment files contain all the settings (model setting, data settings, model hyper-parameters, storage paths and names for relevant output files) and receive inputs from args

trainer files receive experiment files as args and produce trained objects according to settings

tester files run pre-trained objects on test-data and run eval methods

eval methods receive metrics and predictions according to some API and do stuff that generates numbers

plotting methods visualize eval

We can improve on this I am sure, but I would imagine making this modular will very quickly yield benefits.

karalets commented 4 years ago

One cool example of factorization is the dataloaders etc. in pytorch:

https://pytorch.org/docs/stable/data.html

You can define in separate classes things like:

Then objects like this dataloader are passed to trainer classes which tie this to models and deliver batches for training. The dataloader class can be kept invariant to compare all kinds of models while having an auditable 'version' of the training data and pre-processing. In our case, I would like the experimental setup and choices to be auditable by being stored in some experiment definition which can be changed in its attributes for comparing different experiments.

If you prefer not to use as much bespoke pytorch that is fine, I am just suggesting looking at examples of how modern ML software works on separation of concerns.

yuanqing-wang commented 4 years ago

In the Net class I would make the representation and parametrization objects concrete. I.e. you can play the inheritance game and create explicit classes that inherit from Net and have a concrete form. Else you do not win much here.

Not sure if I followed. the objects are taken as parameters here

yuanqing-wang commented 4 years ago

I'll further factorize the experiment

karalets commented 4 years ago

In the Net class I would make the representation and parametrization objects concrete. I.e. you can play the inheritance game and create explicit classes that inherit from Net and have a concrete form. Else you do not win much here.

Not sure if I followed. the objects are taken as parameters here

Yes, the objects are parameters and that is very nice and would already do if the experiment file factors things sufficiently. An option would be to just create, for each combination of objects, a particular subclass, as other things may also change.

But that is unnecessary for now as we can do all of that later, I am ok with it.

yuanqing-wang commented 4 years ago

@karalets

Would something like this be a bit better? https://github.com/choderalab/pinot/blob/master/pinot/app/train.py

karalets commented 4 years ago

I am still unsure if you can do the cases described below.

Think about wanting to define an experiment which can differ in the following ways:

  • use 20% more training data, but the same settings otherwise
  • use or do not use semi-supervised data, same otherwise
  • use a particular semi-supervised background dataset or another one, but the same main training set
  • try the same data settings but different models
  • play with hyperparameter selection for each experiment
  • get new metrics for all of the versions of the above when you have pre-trained models lying around
  • have new test data that may arrive
  • think about a joint model over representation and regression vs a disjoint model, how can you still do all you want?
  • ...
yuanqing-wang commented 4 years ago

These could be done by simply changing some args in the script. -use 20% more training data, but the same settings otherwise -try the same data settings but different models -play with hyperparameter selection for each experiment

yuanqing-wang commented 4 years ago

The rest would can be done by using the APIs but with small twists in the scripts.

karalets commented 4 years ago

Ok, could you run a test-playthrough with an off-the-shelf semi-supervised model, i.e. the one from the paper?

yuanqing-wang commented 4 years ago

Semi-supervised learning has not been implemented yet. Should that be our next step?

karalets commented 4 years ago

I believe it serves the utility of making the pipeline more complete and step 1 should be to have a robust skeleton of the pipeline and examples of the types of workflows we may need.

I think you will understand my asks for more modularization a bit better when you build semi-supervised in there.

Thus: yes, let's proceed to having an example of SS training.

Ideally you could make two examples: one with and one without SS aspects, both using the same training data and as much of the same infrastructure as possible. I.e. ideally the differences only live in the arguments passed to the experiment code.

karalets commented 4 years ago

Hey @yuanqing-wang , do we have at this point a little toy/sandbox example that one could test and run on a laptop in a closed loop? I'd like to play with some of the problems with NN training in a toy example that is easy to re-run.

jchodera commented 4 years ago

Not quite yet, I think. We have the beginnings of this, but I think we're hoping @dnguyen1196 can dive in and get this part going!

karalets commented 4 years ago

I am tagging @dnguyen1196 here to read through the beginning as this issue explains a lot of what is going on here.

dnguyen1196 commented 4 years ago

@karalets @yuanqing-wang

To recap and please correct me, it seems that the goals when this issue was created were:

So within this issue, perhaps two subtasks remain:

karalets commented 4 years ago

Hey,

So you understand the issue here quite well. There are some subtleties with respect to how to specify remaining subtasks.

So within this issue, perhaps two subtasks remain:

  • Add more fine-grain testing capabilities to the current experiment [infrastructure]

Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together. In addition, I would argue, as mentioned in issue #26 , that we should also first individually test components that would do unsupervised or self-supervised learning to learn representations so we can target a reasonable set of things to plug in here. However, in the literature we oftentimes also consider this thing a joint training process as a graphical model which sometimes has more or less evidence at some of the invovled variables, see for instance this https://arxiv.org/abs/1406.5298 and newer literature along those lines https://arxiv.org/abs/1706.00400 .

(https://github.com/choderalab/pinot/blob/master/pinot/app/experiment.py)

  • More cleanly separate between parameterization and representation.

I would not go that route quite yet, I would prefer to be agnostic if the model makes these things communicate uncertainty or not. There may be model classes that have their own way of incorporating one or more variables. Imagine you have a net-class which has a method net.train(X, Y) and when you set Y=None it just updates the parts it needs.

Another model may really be to hackily pretrain two seperate objectives, one just for representation and one for the measurement term, which are then pliugged together correctly according to the degree of supervision in the observed tupel.

The shared API int he infrastructure should make both types of workflows useable, so I would focus on that API and infrastructure first with a concrete example with real data.

I envision that first pre-training some representation based on background data and then finetuning it on labeled data is ok as a start, but keep in mind we may want to train jointly later with a more rigorous treatment of semi-supervised learning.

We should discuss and iterate on a concrete version of this more, but we also need a separate process to just evaluate the different unsupervised models as mentioned in #26 .

dnguyen1196 commented 4 years ago

Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together.

What do you mean by this @karalets ? Is the following interpretation correct? For example, say we have 1000 compounds with their associated properties. We actually first use this as "background" data where we train, for example, an unsupervised representation so that we get a "reasonable" representation first (and not touch the parameterization). And then after we have obtained this reasonable representation, we train both the representation and parameterization jointly on the prediction task (supervised).

karalets commented 4 years ago

Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together.

What do you mean by this @karalets ? Is the following interpretation correct? For example, say we have 1000 compounds with their associated properties. We actually first use this as "background" data where we train, for example, an unsupervised representation so that we get a "reasonable" representation first (and not touch the parameterization). And then after we have obtained this reasonable representation, we train both the representation and parameterization jointly on the prediction task (supervised).

Sorry, to be precise: By "background data" I mean data for which we only have graphs, not the measurements/properties, i.e. background molecules that are not the data we are collecting measurements for, but we know exist as molecules.

Intuitively: we need we need graphs to train "representations", and matched 'measurements' to train likelihoods/observation terms ("parametrizations" although I prefer to fade this term out).

In my world we could consider all this to be training data, but sometimes we only observe X, sometimes we observe the tupel X,Y to train our models and we want to make the best of both.

dnguyen1196 commented 4 years ago

@karalets @yuanqing-wang

Intuitively: we need we need graphs to train "representations", and matched 'measurements' to train likelihoods/observation terms ("parametrizations" although I prefer to fade this term out). In my world we could consider all this to be training data, but sometimes we only observe X, sometimes we observe the tupel X,Y to train our models and we want to make the best of both.

Ok I see your point now. In that regard, I think we might need to modify two interfaces, let me know what you think and if I should start a new issue/discussion on this.

  1. Net Right now net.loss(g, y) takes in two arguments.
    def loss(self, g, y):
        distribution = self.condition(g)
        return -distribution.log_prob(y)

So we can modify this function so that for the case when y = None, we only compute "loss" for the representation layer.

  1. For the experiment.py interface, I think we have two options

2a. Add TrainUnsupervised, TestUnsupervised, etc (basically for every current supervised training/testing class, we need a corresponding class for unsupervised training). This will probably repeat a lot of codes but supervised and unsupervised training will involve different optimizers, potentially very different choice of hyperparameters. So if we have separate unsupervised and supervised classes, we can have another class that can combine supervised and unsupervised components together.

2b. Modify the current Train and Test class so that it accommodates both unsupervised and supervised training. This will involve modifying the current constructor to take in more arguments (optimizer for unsupervised training vs supervised training, hyperparameters for unsupervised training). And within the class implementation, more care is needed to make sure the training/testing steps are in the right order.

I think 2a is better, although we repeat more codes but the modularity allows us to do more fine grain training/testing.