Any variants found in the regulatory_regions db table of known regulatory regions from FANTOM and Ensembl regulatory build AND an effect of INTERGENIC_VARIANT or UPSTREAM_GENE_VARIANT get the effect changed to REGULATORY_REGION_VARIANT. This is to stop them getting removed in step 3 (? if Jannovar ever assigns this effect - I don't think so)
Any variants with an effect of REGULATORY_REGION_VARIANT get reassigned to the gene in the TAD with the best pheno score
The regulatoryFeature filter removes any variant with an effect of INTERGENIC_VARIANT or UPSTREAM_GENE_VARIANTAND >= 20kb away from gene
Max's preferred behaviour
Reassign variants to best gene in TAD for most nc variants
CODING_TRANSCRIPT_INTRON_VARIANT
CONSERVED_INTERGENIC_VARIANT
CONSERVED_INTRON_VARIANT
DOWNSTREAM_GENE_VARIANT
INTERGENIC_REGION
INTERGENIC_VARIANT
INTRAGENIC_VARIANT
INTRON_VARIANT
NON_CODING_TRANSCRIPT_INTRON_VARIANT
REGULATORY_REGION_VARIANT
TF_BINDING_SITE_VARIANT
UPSTREAM_GENE_VARIANT
Don't filter variants based on being >= 20 kb from a gene and not in a FANTOM/Ensembl reg build feature but rather use ReMM < 0.5 instead.
I think we can achieve (2) with the minimum code changes by introducing a pathogenicityScoreFilter or remmScoreFilter so users can optionally choose to skip the regulatoryFeatureFilter and use a remmScore > 0.5 filter instead. This way variants in FANTOM/Ensembl regulatory regions will still get flagged as REGULATORY_REGION_VARIANTS for display purposes.
Still need to decide whether to update the list of variant effects for TAD gene reassignment or maybe make it user-configurable?
If we do it this way with the old behaviour still possible then we don't need to worry so much about repeating the whole simulated genomes benchmarking we did in the original paper i.e. users could test both options on their own datasets and make a decision based on compute time, nos of variants returned and identification of known diagnostic nc variants
First attempt to look at running it with Max's suggestions on a WGS of 6943867 variants:
Took 50 mins to run with 50Gb. Output results for 2,357 genes and 23,367 variants compared to 4521 genes and 43884 variants running it the usual way in 32 mins
New top hit: 2 29538028 G C intronic variant now assigned to C2orf71 rather than ALK
@visze @julesjacobsen @pnrobinson I combined the 3 prev issues discussing this into one new issue as they are all inter-related and to simplify the discussion!
Current nc variant behaviour is:
Max's preferred behaviour
CODING_TRANSCRIPT_INTRON_VARIANT
CONSERVED_INTERGENIC_VARIANT
CONSERVED_INTRON_VARIANT
DOWNSTREAM_GENE_VARIANT
INTERGENIC_REGION
INTERGENIC_VARIANT
INTRAGENIC_VARIANT
INTRON_VARIANT
NON_CODING_TRANSCRIPT_INTRON_VARIANT
REGULATORY_REGION_VARIANT
TF_BINDING_SITE_VARIANT
UPSTREAM_GENE_VARIANT
Peter's preferred display behaviour