AstraZeneca-NGS / reference_data

Reference data: BED files, genes, transcripts, variations.
81 stars 29 forks source link

Reference data

Capture region BED files

Collects commonly used capture region BED files. These are installed and available for use in bcbio analyses. Includes files for hg19 (chr1, chr2, chr3... style naming) and GRCh37 (1, 2, 3... style naming).

Canonical transcripts

Files under transcripts/cancer_transcripts_*_ensembl.txt contain IDs of canonical (longest) transcripts that are used by SnpEff variant prediction tool when it run with the -canon flag (only in Ensembl-based versions of reference databases GRCh37. and GRCh38. in SnpEff notation). Since not all IDs in the list represent the most cancer-relevant isoforms, transcripts/canon_cancer_replacement.txt provides a map of transcripts for replacement with the -canonList option:

java -jar snpEff.jar GRCh37.75 test.vcf -canon -canonList transcripts/canon_cancer_replacement.txt

To use the canonical transcripts for variant annotation in bcbio, add the following into your configuration YAML file:

algorithm:
  effects_transcripts: canon

To use the cancer transcripts, use the following:

algorithm:
  effects_transcripts: canonical_cancer

The full list of genes with replaced transcripts:

AKT1     ENST00000555528
BRCA1    ENST00000357654
CD79B    ENST00000006750
CDKN2A   ENST00000304494
CHEK1    ENST00000534070
CHEK2    ENST00000328354
ESR1     ENST00000206249
FANCL    ENST00000233741
FGFR1    ENST00000447712
FGFR2    ENST00000457416
FGFR3    ENST00000440486
MET      ENST00000397752
MYD88    ENST00000396334
PPP2R2A  ENST00000380737
RAD51D   ENST00000345365
RAD54L   ENST00000371975
GNAS     ENST00000371085
TP53     ENST00000269305
ARID1B   ENST00000350026
TET2     ENST00000380013
CEBPA    ENST00000498907
PIK3C2G  ENST00000538779