Closed FedericaBrando closed 6 months ago
By commenting out the following lines: https://github.com/bbglab/intogen-plus/blob/7564e7c97fd9d69e29f86a5ec527c680171b93f7/core/intogen_core/parsers/vep.py#L45-L49
I run a one-cohort test with PCAWG_BILIARY_ADENOCA
and some of the missing variant got recovered, other ones where deleted because of the second filtering check:
that is filtering out mutations based on the transcript, if it's in the ensembl_canonical_transcripts.tsv
then it's filtered out, otherwise not.
Running PCAWG
cohort.
Commit: https://github.com/bbglab/intogen-plus/commit/909bafeb7c16a8a5586025e2fb3ad87a707ff82a
Waiting for results to be analysed
Taking out the filtering is giving weird error with lymphoids ttype, therefore we decided to keep the filtering only for those ttype that are: germ_center = ["AML","LY","CLL","MDS","DLBCL","NHLY"]
Things to do:
>>> import bgoncotree.main as BGOncoTree>>> import bgoncotree.main as BGOncoTree
>>> tree = BGOncoTree('/workspace/datasets/oncotree/oncotree/oncotree.tsv')
>>> [node.id for node, level in tree.iter_from(tree['LYMPH'], descending=True)]
['LYMPH', 'LNM', 'NHL', 'MBN', 'LPL', 'WM', 'THRLBCL', 'SPB', 'MZL', 'SMZL', 'NMZL', 'EMALT', 'SBLU', 'SDRPL',
'HCL-V', 'PTFL', 'PMBL', 'PLBL', 'PEL', 'PCNSL', 'PCM', 'PCLBCLLT', 'PCFCL', 'MIDD', 'MIDDO', 'MIDDA', 'MHCD',
'MGUS', 'MGUSIGM', 'MGUSIGG', 'MGUSIGA', 'MCL', 'ISMCL', 'MCBCL', 'LYG', 'LBLIRF4', 'IVBCL', 'FL', 'ISFN',
'DFL', 'HHV8DLBCL', 'HGBCLMYCBCL2', 'HGBCL', 'HCL', 'GHCD', 'DLBCLNOS', 'GCB', 'ABC', 'EP', 'EBVMCU',
'EBVDLBCLNOS', 'DLBCLCI', 'CLLSLL', 'BPLL', 'BLL11Q', 'BL', 'BCLU', 'ALKLBCL', 'AHCD', 'MTNN', 'TPLL',
'TLGL', 'SS', 'SPTCL', 'SEBVTLC', 'PTCL', 'PCSMTPLD', 'PCLPD', 'PCALCL', 'LYP', 'PCGDTCL', 'PCATCL', 'PCAECTCL',
'NPTLTFH', 'MYCF', 'MEITL', 'ITLPDGI', 'HVLL', 'HSTCL', 'FTCL', 'ENKL', 'EATL', 'CLPDNK', 'ALCL', 'BIALCL', 'ALCLALKP',
'ALCLALKN', 'ATLL', 'ANKL', 'AITL', 'ALL', 'TLL', 'NKCLL', 'ETPLL', 'BLL', 'BLLRGA', 'BLLTCF3PBX1', 'BLLKMT2A',
'BLLIL3IGH', 'BLLIAMP21', 'BLLHYPO', 'BLLHYPER', 'BLLETV6RUNX1', 'BLLBCRABL1L', 'BLLBCRABL1', 'BLLNOS', 'PTLD',
'PPTLD', 'PHPTLD', 'MPTLD', 'IMPTLD', 'FHPTLD', 'CHLPTLD', 'HL', 'CHL', 'NSCHL', 'MCCHL', 'LRCHL', 'LDCHL', 'NLPHL', 'LBGN', 'LATL']
Run PCAWG_WGS_LYMPH_BNHL
.
Results did not change, we have the drivers filtered out because of Signature 9.
What do we do next? Possible solutions:
Final list of consequence type:
VALID_CONSEQUENCES = {
"transcript_ablation", "splice_donor_variant",
"splice_acceptor_variant", "stop_gained",
"frameshift_variant", "stop_lost", "start_lost",
"initiator_codon_variant", "transcript_amplification",
"feature_elongation", "feature_truncation",
"inframe_insertion", "inframe_deletion",
"protein_altering_variant",
"missense_variant",
"splice_donor_5th_base_variant",
"splice_region_variant", "incomplete_terminal_codon_variant",
"start_retained_variant", "stop_retained_variant",
"synonymous_variant", "coding_sequence_variant"
}
TCGA_WXS_ACC
cohort.By a quick comparison between a run w/out the new filters and a run with, we start we the same number of mutation (7163), in the new filter implementation we get 6067, while before we were discarding some mutation (6046).
❯ zcat develop_6b42c16f_HART_TCGA/steps/variants/TCGA_WXS_ACC.tsv.gz| wc
7163 64467 618615
❯ zcat 20231018_TCGA/steps/variants/TCGA_WXS_ACC.tsv.gz| wc
7163 64467 618615
❯ zcat develop_6b42c16f_HART_TCGA/steps/vep/TCGA_WXS_ACC.tsv.gz| wc
6067 145608 1413130
❯ zcat 20231018_TCGA/steps/vep/TCGA_WXS_ACC.tsv.gz| wc
6046 145104 1408372
In the
vep.py
script there is a set of VALID_CONSEQUENCES uses as a filtering. It was implemented in 2018 but we do not know why.https://bitbucket.org/intogen/intogen-plus/src/c02f04a6e39ed0c739885ea4779f2bd9ceeb7800/filters/vep.py#lines-6
https://github.com/bbglab/intogen-plus/blob/7564e7c97fd9d69e29f86a5ec527c680171b93f7/core/intogen_core/parsers/vep.py#L13-L18
Ideally we want to get rid of this filter and run boostdm on the unfiltered mutation list,. The test will be done on
PCAWG
cohorts to see if by deleting this filtering we have any failing in one of the method that uses the output ofvep.py
as input.This issue came up because in the benchmarking of boostDM the two mutation.tsv file were not returning the same mutations from boostdm2020 and boostdm2023. This was due to the fact that mutation.tsv for Boostdm2020 was not taken from the output of Intogen2020 because it was not implemented the step, but it was rather taken from the dndscv data.
For Release2023 of BoostDM the
mutation.tsv
used was taken directly from the output of intogen, keeping this filtering of the consequence type that should not be there.