bbglab / intogen-plus

a framework for automatic and comprehensive knowledge extraction based on mutational data from sequenced tumor samples from patients.
https://www.intogen.org/search
Other
0 stars 1 forks source link

IntOGen plus | CBase error few mutated cohort #4

Closed FedericaBrando closed 10 months ago

FedericaBrando commented 11 months ago

Stefano reported an error concerning Cbase:

In certain cohorts is detecting all genes as significant with a unique very low q-value.

On this topic, CBase version that we use in IntOGen and the one currently on their website differ http://genetics.bwh.harvard.edu/cbase/downloads.html) . We use v.1.0 while on their web they use v.1.1 . I am still looking for some sort of release notes, to understand what they have changed.

Image

according to ferran: Let's see what is in v1.1, because the modifications that we implemented were a direct piece of advise by Donate Weghorn -- the method's author -- hereself and chances are that they have incorporated this type of heuristics

list of cohorts that report the problem:

Image

FedericaBrando commented 11 months ago

CBaSE v.1.1 <-> CBaSE v1.0

Code folder

In intOGen we use CBaSE dataset v1.1 and CBaSE code v1.0.

FedericaBrando commented 11 months ago

Roadmap:

  1. [x] test several of the bugged cohorts with code CBaSE1.1
  2. [x] look at differences between code v1.0 and v1.1
FedericaBrando commented 11 months ago

between CBaSE v1.0 and CBaSE v1.1 the main different (without our tuning of the code) is the following:

CBaSE v.1.0 vs CBaSE v1.1 - parameters

    genes_by_sobs = [[ka, len(list(gr))] for ka, gr in it.groupby(sorted(
                              genes, key=lambda arg: int(arg["obs"][2])), key=lambda arg: int(arg["obs"][2]))]

        summe = 0.
    if modC == 1:
-       for gind in range(len(genes)):
+       for sval in genes_by_sobs:
            s = sval[0]
        [...]

CBaSE v.1.0 vs CBaSE v1.1 - q_value

# *************** lambda ~ Gamma:
  def pofs(s, L):
-   return (L * b) ** s * (1 + L * b) ** (-s - a) * math.gamma(s + a) / (math.gamma(s + 1) * math.gamma(a))
+   return np.exp(s * np.log(L * b) + (-s - a) * np.log(1 + L * b) + sp.gammaln(s + a) - sp.gammaln(s + 1) - sp.gammaln(a))
FedericaBrando commented 11 months ago

ask Ferran about these modifications and the differences between the fine tuning as well

FedericaBrando commented 11 months ago

Conclusions:

FedericaBrando commented 10 months ago

Test with all models:

Cohort tested so far -->

all of them results don't show any weird behaviour.

Test with inherited choice of model given the lenght of mutations:

Cohort tested so far -->

FedericaBrando commented 10 months ago

next steps:

FedericaBrando commented 10 months ago

waiting for permission (group bbg_beataml) to run 33k samples, for analysis

migrau commented 10 months ago

Reminder sent to IT... The first email is from 11th/Oct...

FedericaBrando commented 10 months ago

run completed

Completed at: 13-Nov-2023 04:23:39
Duration    : 6d 13h 5m 13s
CPU hours   : 25'009.1 (0% failed)
Succeeded   : 8'005
Ignored     : 6
Failed      : 6