bbglab / intogen-plus

a framework for automatic and comprehensive knowledge extraction based on mutational data from sequenced tumor samples from patients.
https://www.intogen.org/search
Other
0 stars 1 forks source link

IntOGen Plus | Revise `vep.py` VALID_CONSEQUENCE set #13

Closed FedericaBrando closed 6 months ago

FedericaBrando commented 8 months ago

In the vep.py script there is a set of VALID_CONSEQUENCES uses as a filtering. It was implemented in 2018 but we do not know why.

https://bitbucket.org/intogen/intogen-plus/src/c02f04a6e39ed0c739885ea4779f2bd9ceeb7800/filters/vep.py#lines-6

https://github.com/bbglab/intogen-plus/blob/7564e7c97fd9d69e29f86a5ec527c680171b93f7/core/intogen_core/parsers/vep.py#L13-L18

Ideally we want to get rid of this filter and run boostdm on the unfiltered mutation list,. The test will be done on PCAWG cohorts to see if by deleting this filtering we have any failing in one of the method that uses the output of vep.py as input.

  1. Test w/ filter valid_consequence
  2. if it works -> implement it pipeline
  3. run v2023 with unfiltered list of mutation

This issue came up because in the benchmarking of boostDM the two mutation.tsv file were not returning the same mutations from boostdm2020 and boostdm2023. This was due to the fact that mutation.tsv for Boostdm2020 was not taken from the output of Intogen2020 because it was not implemented the step, but it was rather taken from the dndscv data.

For Release2023 of BoostDM the mutation.tsv used was taken directly from the output of intogen, keeping this filtering of the consequence type that should not be there.

FedericaBrando commented 7 months ago

By commenting out the following lines: https://github.com/bbglab/intogen-plus/blob/7564e7c97fd9d69e29f86a5ec527c680171b93f7/core/intogen_core/parsers/vep.py#L45-L49

I run a one-cohort test with PCAWG_BILIARY_ADENOCA and some of the missing variant got recovered, other ones where deleted because of the second filtering check:

https://github.com/bbglab/intogen-plus/blob/7564e7c97fd9d69e29f86a5ec527c680171b93f7/core/intogen_core/parsers/vep.py#L49-L56

that is filtering out mutations based on the transcript, if it's in the ensembl_canonical_transcripts.tsv then it's filtered out, otherwise not.

FedericaBrando commented 7 months ago

Running PCAWGcohort.

Commit: https://github.com/bbglab/intogen-plus/commit/909bafeb7c16a8a5586025e2fb3ad87a707ff82a

FedericaBrando commented 7 months ago

Waiting for results to be analysed


Taking out the filtering is giving weird error with lymphoids ttype, therefore we decided to keep the filtering only for those ttype that are: germ_center = ["AML","LY","CLL","MDS","DLBCL","NHLY"]

Things to do:

FedericaBrando commented 7 months ago
>>> import bgoncotree.main as BGOncoTree>>> import bgoncotree.main as BGOncoTree
>>> tree = BGOncoTree('/workspace/datasets/oncotree/oncotree/oncotree.tsv')
>>> [node.id for node, level in tree.iter_from(tree['LYMPH'], descending=True)]
['LYMPH', 'LNM', 'NHL', 'MBN', 'LPL', 'WM', 'THRLBCL', 'SPB', 'MZL', 'SMZL', 'NMZL', 'EMALT', 'SBLU', 'SDRPL', 
'HCL-V', 'PTFL', 'PMBL', 'PLBL', 'PEL', 'PCNSL', 'PCM', 'PCLBCLLT', 'PCFCL', 'MIDD', 'MIDDO', 'MIDDA', 'MHCD', 
'MGUS', 'MGUSIGM', 'MGUSIGG', 'MGUSIGA', 'MCL', 'ISMCL', 'MCBCL', 'LYG', 'LBLIRF4', 'IVBCL', 'FL', 'ISFN', 
'DFL', 'HHV8DLBCL', 'HGBCLMYCBCL2', 'HGBCL', 'HCL', 'GHCD', 'DLBCLNOS', 'GCB', 'ABC', 'EP', 'EBVMCU', 
'EBVDLBCLNOS', 'DLBCLCI', 'CLLSLL', 'BPLL', 'BLL11Q', 'BL', 'BCLU', 'ALKLBCL', 'AHCD', 'MTNN', 'TPLL', 
'TLGL', 'SS', 'SPTCL', 'SEBVTLC', 'PTCL', 'PCSMTPLD', 'PCLPD', 'PCALCL', 'LYP', 'PCGDTCL', 'PCATCL', 'PCAECTCL', 
'NPTLTFH', 'MYCF', 'MEITL', 'ITLPDGI', 'HVLL', 'HSTCL', 'FTCL', 'ENKL', 'EATL', 'CLPDNK', 'ALCL', 'BIALCL', 'ALCLALKP', 
'ALCLALKN', 'ATLL', 'ANKL', 'AITL', 'ALL', 'TLL', 'NKCLL', 'ETPLL', 'BLL', 'BLLRGA', 'BLLTCF3PBX1', 'BLLKMT2A', 
'BLLIL3IGH', 'BLLIAMP21', 'BLLHYPO', 'BLLHYPER', 'BLLETV6RUNX1', 'BLLBCRABL1L', 'BLLBCRABL1', 'BLLNOS', 'PTLD', 
'PPTLD', 'PHPTLD', 'MPTLD', 'IMPTLD', 'FHPTLD', 'CHLPTLD', 'HL', 'CHL', 'NSCHL', 'MCCHL', 'LRCHL', 'LDCHL', 'NLPHL', 'LBGN', 'LATL']
FedericaBrando commented 7 months ago

Run PCAWG_WGS_LYMPH_BNHL .

Results did not change, we have the drivers filtered out because of Signature 9.

What do we do next? Possible solutions:

FedericaBrando commented 7 months ago

IntOGen meeting 29/01/24

FedericaBrando commented 7 months ago

Final list of consequence type:

VALID_CONSEQUENCES = {
        "transcript_ablation", "splice_donor_variant", 
        "splice_acceptor_variant", "stop_gained",
        "frameshift_variant", "stop_lost", "start_lost",
        "initiator_codon_variant", "transcript_amplification",
        "feature_elongation", "feature_truncation",
        "inframe_insertion", "inframe_deletion",
        "protein_altering_variant",
        "missense_variant",
        "splice_donor_5th_base_variant",
        "splice_region_variant", "incomplete_terminal_codon_variant",
        "start_retained_variant", "stop_retained_variant",
        "synonymous_variant", "coding_sequence_variant"
    }
FedericaBrando commented 6 months ago

Implementation of the VEP new variant - check TCGA_WXS_ACC cohort.

By a quick comparison between a run w/out the new filters and a run with, we start we the same number of mutation (7163), in the new filter implementation we get 6067, while before we were discarding some mutation (6046).

❯ zcat develop_6b42c16f_HART_TCGA/steps/variants/TCGA_WXS_ACC.tsv.gz| wc
   7163   64467  618615
❯ zcat 20231018_TCGA/steps/variants/TCGA_WXS_ACC.tsv.gz| wc
   7163   64467  618615
❯ zcat develop_6b42c16f_HART_TCGA/steps/vep/TCGA_WXS_ACC.tsv.gz| wc
   6067  145608 1413130
❯ zcat 20231018_TCGA/steps/vep/TCGA_WXS_ACC.tsv.gz| wc
   6046  145104 1408372