IntOGen plus | CBase error few mutated cohort

FedericaBrando commented 11 months ago

Stefano reported an error concerning Cbase:

In certain cohorts is detecting all genes as significant with a unique very low q-value.

On this topic, CBase version that we use in IntOGen and the one currently on their website differ http://genetics.bwh.harvard.edu/cbase/downloads.html) . We use v.1.0 while on their web they use v.1.1 . I am still looking for some sort of release notes, to understand what they have changed.

long-term solution: migrating our edits to new version
short-term solution: report 0 drivers when this case happen.

according to ferran: Let's see what is in v1.1, because the modifications that we implemented were a direct piece of advise by Donate Weghorn -- the method's author -- hereself and chances are that they have incorporated this type of heuristics

list of cohorts that report the problem:

FedericaBrando commented 11 months ago

CBaSE v.1.1 <-> CBaSE v1.0

Code folder

CBase_v1.1.py <-> cbase.py
Auxiliary/CBaSE_v1.1_parameters.py
Auxiliary/CBaSE_v1.1_qvalues.py

In intOGen we use CBaSE dataset v1.1 and CBaSE code v1.0.

FedericaBrando commented 11 months ago

Roadmap:

[x] test several of the bugged cohorts with code CBaSE1.1
[x] look at differences between code v1.0 and v1.1

FedericaBrando commented 11 months ago

between CBaSE v1.0 and CBaSE v1.1 the main different (without our tuning of the code) is the following:

The code is segmented in 3 python scripts:
- a python script that calls the two main steps: parameters and qvalues.
- a subfolder named "auxiliary" where these python scripts are.

CBaSE v.1.0 vs CBaSE v1.1 - parameters

in the neg_ln_L function a genes_by_sobs list of lists is added.

    genes_by_sobs = [[ka, len(list(gr))] for ka, gr in it.groupby(sorted(
                              genes, key=lambda arg: int(arg["obs"][2])), key=lambda arg: int(arg["obs"][2]))]

        summe = 0.
    if modC == 1:
-       for gind in range(len(genes)):
+       for sval in genes_by_sobs:
            s = sval[0]
        [...]

we have the option of model 0 that tries every model

CBaSE v.1.0 vs CBaSE v1.1 - q_value

in the compute_p_values, when choosing model 1, pofs function:

# *************** lambda ~ Gamma:
  def pofs(s, L):
-   return (L * b) ** s * (1 + L * b) ** (-s - a) * math.gamma(s + a) / (math.gamma(s + 1) * math.gamma(a))
+   return np.exp(s * np.log(L * b) + (-s - a) * np.log(1 + L * b) + sp.gammaln(s + a) - sp.gammaln(s + 1) - sp.gammaln(a))

FedericaBrando commented 11 months ago

ask Ferran about these modifications and the differences between the fine tuning as well

FedericaBrando commented 11 months ago

Conclusions:

[x] migrate the edits done for v1.0 to run in intogen in v1.1
[x] test the cohorts.

FedericaBrando commented 10 months ago

Test with all models:

Cohort tested so far -->

[x] CBIOP_WXS_ACYC_SANGER_2013
[x] HARTWIG_WGS_MCC_2020
[x] ICGC_WGS_ES_BOCA_FR_AD_2019

all of them results don't show any weird behaviour.

Test with inherited choice of model given the lenght of mutations:

Cohort tested so far -->

[ ] CBIOP_WXS_ACYC_SANGER_2013
[ ] HARTWIG_WGS_MCC_2020
[x] ICGC_WGS_ES_BOCA_FR_AD_2019

FedericaBrando commented 10 months ago

next steps:

[x] build datasets + containers w/ dev
[x] run PEDBIO + STJUDE

FedericaBrando commented 10 months ago

waiting for permission (group bbg_beataml) to run 33k samples, for analysis

migrau commented 10 months ago

Reminder sent to IT... The first email is from 11th/Oct...

FedericaBrando commented 10 months ago

run completed

Completed at: 13-Nov-2023 04:23:39
Duration    : 6d 13h 5m 13s
CPU hours   : 25'009.1 (0% failed)
Succeeded   : 8'005
Ignored     : 6
Failed      : 6

bbglab / intogen-plus