Gho-Ost / pathogenicity-assessment

Research in classification of pathogenicity in genetic variants
0 stars 0 forks source link

Assessing Pathogenicity in Genetic Variants

images

Introduction and Problem Description

The problem of assessing pathogenicity in genetic variants is a problem of classification. The classification system was created by The American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG-AMP). 5 classes can be distinguished: benign, likely benign, variant of unknown significance (VUS), likely pathogenic, and pathogenic.


Paper Results Draft

https://drive.google.com/drive/folders/1ojq0JBdMx_b7wlClk_IITBavSIGLq3df?usp=sharing


Data

Raw data format (vcf) specification: https://samtools.github.io/hts-specs/VCFv4.1.pdf

VEP (CSQ) outputs: http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#defaultout

To read .csv converted data:

from utils.utils import get_dataset

# To get entire dataset
df = get_dataset("../data/", samples=["EE_015", "EE_050", "EE_069"], file_type="both", option_csq="potential", 
            options_genotype=["potential", "all"], with_default=True)

More examples of loading data: new_read_test.ipynb


File structure


└───archive
└───data
    ├───EE_sample*
    ├───EE_015
    ├───EE_050
    └───EE_069
          ├───EE_069.vcf.gz
          ├───EE_069_default.csv.gz
          ├───EE_069_genotype.csv.gz
          └───EE_069_csq.csv.gz

*EE_sample contains uncompressed files


Working Documentation