bbglab / intogen-plus

a framework for automatic and comprehensive knowledge extraction based on mutational data from sequenced tumor samples from patients.
https://www.intogen.org/search
Other
0 stars 1 forks source link

IntOGen Plus | Grid search for fitting the optimal method combination is inefficient and not scalable #11

Open koszulordie opened 9 months ago

koszulordie commented 9 months ago

The current approach to scan the grid of possible weight vectors to find an optimal method combination is inefficient and expected to blow up as we add new methods to the combination pool. Instead of using an exhaustive grid search with a predefined grid step, we can use a filling-space random grid search strategy. This is a technique to randomly sample vectors from the grid in such a way that the subsequent samples avoid the areas of the previously sampled ones.

One common approach to do this is based on latin hypercubes. The approach has been already implemented in the Python package "scikit-optimize". Specifically, we can make use of the skopt.sampler.Lhs class: https://scikit-optimize.github.io/stable/modules/generated/skopt.sampler.Lhs.html

FedericaBrando commented 8 months ago

Extract of a conversation that can be relevant to keep it in here for posterity.

I need to set up the space with n-dimension as the number of methods (example 8 ) and then sample as many time as I want (ex. 10000). Then apply the constraints to verify if the vector fulfills them. The bottleneck is indeed the number of sampling, because I will be iterating through that, but the w.c.s is O(samples) which is definitely lower than 21^methods

I was thinking, since the constraint are the following:

    """
    These are the constraints that must be in place:
    constraint 1: sum w_i = 1
    constraint 2: w_i >= 0.05 for all w_i may not apply if some w_i is discarded
    constraint 3: w_i <= 0.3 for all w_i

    """

Maybe we can set the space to be (0, 0.3) given the third constraint, what is the point on looking for a weight higher than 0.3 if we are going to discard it anyway.

In this way we don't need to do any "normalization" of the vector and rely on the result from the sampling, since the chance of hitting a vector that fulfill the first if condition (if sum(w) <= 1 - LOWER_BOUND | where LOWER_BOUND is 0.05) is higher.

here is the code:


dim = len(METHODS) - len(low_quality_index)

    space = Space([(0.0, 0.3)]*dim)
    lhs = Lhs(lhs_type='classic', criterion='maximin', iterations=1000)

    for w in lhs.generate(dimensions=space.dimensions, n_samples=1000):
        if sum(w) <= 1 - LOWER_BOUND:                        # inside the simplex
            w_dim = list(np.append(w, [1 - sum(w)]))   # consequently len(w_dim) == dim
            if all_constraints(w_dim):
                w_all = fill_with_zeros(w_dim, low_quality_index)
                f = func(w_all)
                if optimum['Objective_Function'] > f:  # remember we are running a minimization
                    optimum['Objective_Function'] = f
                    for i, v in enumerate(METHODS):
                        optimum[v] = w_all[i]

Could it make sense?

koszulordie commented 8 months ago

We need the weights we sample $w_i$ to be specified as proportions of the total weight, meaning $\sum_i w_i = 1$, in order to apply the constraints on the relative voting rights that the methods are allowed to hold when they are combined. If we restrict our search to $(0,3)^n$ instead of $(0,1)^n$ we still have the problem that we need to convert the vector into proportions and make sure they satisfy the additional constraint: $0.05 \leq w_i\leq 0.3$ for each i. In other words, sampling from $(0,3)$ or from $(0.05, 0.3)$ does not mean the respective weight, once converted to a proportion, will satisfy the constraints. Unfortunately, the method we are using does not allow to do LHC sampling whilst satisfying the constraints, a.k.a. constrained LHC sampling and I have not been able to find any ready-to-use solution.

FedericaBrando commented 8 months ago

Okay, so if I am understanding correctly, basically restricting the space to $(0,0.3)$ is just rescaling the problem, not solving it.

My concern is given by the first if condition:

if sum(w) <= 1 - LOWER_BOUND:  # 0.95

I was running an LHS sampling of $n=10000$ and it was never finding $w$ with a $\sum_i w_i = 1$, so I was looking in other implementations of the method and I came across the scipy one: Here

When reading the example they mention taken from the Latin hypercube sampling and the sensitivity analysis of a Monte Carlo epidemic model I was thinking that maybe we could borrow the idea of rescaling the $w$ dimensions with the appropriate bounds $(0.05,0.3)$, using the built-in qmc.scale() and then apply the constraint. But, as before, maybe something that makes sense to me, it indeed does not apply to our problem..

FedericaBrando commented 8 months ago

Implementation:

Testing phase:

Combination - test CBIOP_WXS_PRAD_BROAD

7 methods - exhaustive 8 methods - exhaustive 8 methods - LHS
1h 4m 53s 7h 19s 15m 19s
migrau commented 7 months ago

An additional test could be running intogen in the beeGFS partition from the new IRB cluster

FedericaBrando commented 7 months ago

Recap - meeting with Ferran - 16/01

Although the speed improved significantly with the implementation of LHC sampling, the longest run takes aproximately 1d 12h. Still a lot.

Discussing with Ferran we came up with the following step to optimize the computational cost of the method:

  1. Keep only the LHS optimization and get rid of the optimize_with_seed() https://github.com/bbglab/intogen-plus/blob/7564e7c97fd9d69e29f86a5ec527c680171b93f7/combination/intogen_combination/grid_optimizer.py#L245-L248

  2. Store the the change of optimum for every iteration in order to analyse it after.

https://github.com/bbglab/intogen-plus/blob/7564e7c97fd9d69e29f86a5ec527c680171b93f7/combination/intogen_combination/grid_optimizer.py#L201-L204

  1. Try different number of sampling (100 - 1000) - Ideally a number dependant on the dimension of the simplex that ensure an equal distribution of the sampling in order to maximize the exploration (correct if wrong).

  2. Run with TCGA and HARTWIG

  3. Check for time improvement - compare with v2023 and dev-O3D

  4. Check for changing in in results between runs