choderalab / pinot

Probabilistic Inference for NOvel Therapeutics
MIT License
15 stars 2 forks source link

Quick fix to make semi-supervised active learning experiment work #73

Closed dnguyen1196 closed 4 years ago

dnguyen1196 commented 4 years ago

@miretchin

In semi-supervised experiment, some of the y labels will be None. torch.tensor(ys) will return error if the array contains None. Therefore, we would need to find ways to work with array containing None. Immediately, torch.max no longer works. But we can always replace it with max([y for y in ys if y is not None]) for now.


In plotting_active.py

actual_sol = max([y for y in ys if y is not None])

And when collecting the results at the end

            x = bo.run(num_rounds=self.num_rounds)

            # pad if experiment stopped early
            # candidates_acquired = limit + 1 because we begin with a blind pick
            results_shape = self.num_rounds * self.q + 1
            results_data = actual_sol*np.ones(results_shape)

            y_not_none = torch.tensor([ys[i] for i in x if ys[i] is not None])
            results_data[:len(y_not_none)] = np.maximum.accumulate(y_not_none.cpu().squeeze())

One point of concern is that the current code uses results_data[:len(x)] = np.maximum.accumulate(ys[x].cpu().squeeze()). Therefore, it is unclear how using results_data[:len(y_not_none)] (which might be fewer elements) will break downstream data visualization code?


In experiment.py The blind pick needs to re pick if we pick an unlabelled data point

    def blind_pick(self, seed=2666):
        ...
        best = random.choice(self.new)
        while self.data[1][best] is None: <---- Need this in semi-supervised active learning
                                                                   because the first blind pick might pick a sample without
                                                                   measurement
            best = (best + 1) % len(self.data)

        self.old.append(self.new.pop(best))
        return best
    def update_data(self):
        """ Update the internal data using old and new.
        """
        # grab new data
        self.new_data = self.slice_fn(self.data, self.new)

        # grab old data
        self.old_data = self.slice_fn(self.data, self.old)

        # set y_max
        gs, ys = self.old_data
        # self.y_best = torch.max(ys)
        self.y_best = max([y for y in ys if y is not None])

In experiment.py/__init__

        # early stopping
        self.early_stopping = early_stopping
        # self.best_possible = torch.max(self.data[1]) # REplace this with this below
        self.best_possible = max([y for y in self.data[1] if y is not None])

Also the function to slice tensor

def _slice_fn_tensor(x, idxs):
    if type(x) == list:
        return [x[idx] for idx in idxs]
    return x[idxs]
miretchin commented 4 years ago

We discussed this on slack but I'll just add this here for posterity. Thanks for raising the issue. My alternative proposal is to set unlabeled data to np.nan.

import numpy as np
 ys = torch.Tensor([0., 0.3, 0.5, np.nan])
 k = 2
 torch.topk(ys[~torch.isnan(ys)], k)

Doing this would allow you to use a conventional batch function etc. as for a normal tensor.

But actually, I'm not sure this is actually as big of a problem as you might think.

See the semisupervised BO experiment object I created - as constructed, self.old_data is always labeled because we reveal the data to ourselves from self.new to self.old.

https://github.com/choderalab/pinot/blob/d4cce4d49d7cd55bd0d92b75b68e49d0fc860a6a/pinot/active/experiment.py#L221