Clinical-Genomics / genmod

Annotate models of genetic inheritance patterns in variant files (vcf files)
http://moonso.github.io/genmod/
MIT License
74 stars 18 forks source link

GENMOD

DOI Build Status

GENMOD is a simple to use command line tool for annotating and analyzing genomic variations in the VCF file format. GENMOD can annotate genetic patterns of inheritance in vcf:s with single or multiple families of arbitrary size.

The tools in the genmod suite are:

Installation:

GENMOD

pip install genmod

or

git clone https://github.com/moonso/genmod.git
cd genmod
python setup.py install

USAGE:

This is an overview, for more in depth documentation see documentation

Example:

The following command should work when installed successfully. The files are distributed with the package.

$ cat examples/test_vcf.vcf
##fileformat=VCFv4.1
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##contig=<ID=1,length=249250621,assembly=b37>
##reference=file:///humgen/gsa-hpprojects/GATK/bundle/current/b37/human_g1k_v37.fasta
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  father  mother  proband father_2    mother_2    proband_2
1   879537  .   T   C   100 PASS    MQ=1    GT:AD:GQ    0/1:10,10:60    0/1:10,10:60    1/1:10,10:60    0/0:10,10:60    0/1:10,10:60    1/1:10,10:60
1   879541  .   G   A   100 PASS    MQ=1    GT:AD:GQ    ./. 0/1:10,10:60    1/1:10,10:60    ./. 0/1:10,10:60    0/1:10,10:60
1   879595  .   C   T   100 PASS    MQ=1    GT:AD:GQ    0/1:10,10:60    0/0:10,10:60    1/1:10,10:60    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60
1   879676  .   G   A   100 PASS    MQ=1    GT:AD:GQ    0/1:10,10:60    1/1:10,10:60    1/1:10,10:60    0/1:10,10:60    0/1:10,10:60    0/1:10,10:60
1   879911  .   G   A   100 PASS    MQ=1    GT:AD:GQ    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60
1   880012  .   A   G   100 PASS    MQ=1    GT:AD:GQ    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60
1   880086  .   T   C   100 PASS    MQ=1    GT:AD:GQ    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60
1   880199  .   G   A   100 PASS    MQ=1    GT:AD:GQ    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60
1   880217  .   T   G   100 PASS    MQ=1    GT:AD:GQ    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60
10  76154051    .   A   G   100 PASS    MQ=1    GT:AD:GQ    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60
10  76154073    .   T   G   100 PASS    MQ=1    GT:AD:GQ    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60
10  76154074    .   C   G   100 PASS    MQ=1    GT:AD:GQ    ./. 0/1:10,10:60    0/1:10,10:60    0/1:10,10:60    0/1:10,10:60    0/1:10,10:60
10  76154076    .   G   C   100 PASS    MQ=1    GT:AD:GQ    ./. 0/0:10,10:60    0/1:10,10:60    ./. 0/0:10,10:60    0/1:10,10:60
X   302253  .   CCCTCCTGCCCCT   C   100 PASS    MQ=1    GT:AD:GQ    0/0:10,10:60    0/1:10,10:60    1/1:10,10:60    0/0:10,10:60    1/1:10,10:60    1/1:10,10:60
MT  302253  .   CCCTCCTGCCCCT   C   100 PASS    MQ=1    GT:AD:GQ    0/0:10,10:60    0/1:10,10:60    1/1:10,10:60    0/0:10,10:60    1/1:10,10:60    1/1:10,10:60

$ cat examples/test_vcf.vcf |\
>genmod annotate - --annotate-regions |\
>genmod models - --family_file examples/recessive_trio.ped > test_vcf_models_annotated.vcf

$ cat test_vcf_models_annotated.vcf
##fileformat=VCFv4.1
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=Annotation,Number=.,Type=String,Description="Annotates what feature(s) this variant belongs to.">
##INFO=<ID=Exonic,Number=0,Type=Flag,Description="Indicates if the variant is exonic.">
##INFO=<ID=GeneticModels,Number=.,Type=String,Description="':'-separated list of genetic models for this variant.">
##INFO=<ID=ModelScore,Number=.,Type=String,Description="PHRED score for genotype models.">
##INFO=<ID=Compounds,Number=.,Type=String,Description="List of compound pairs for this variant.The list is splitted on ',' family id is separated with compoundswith ':'. Compounds are separated with '|'.">
##contig=<ID=1,length=249250621,assembly=b37>
##reference=file:///humgen/gsa-hpprojects/GATK/bundle/current/b37/human_g1k_v37.fasta
##Software=<ID=genmod,Version=3.0.1,Date="2015-09-22 08:40",CommandLineOptions="processes=4 keyword=Annotation family_type=ped family_file=<open file 'examples/recessive_trio.ped', mode 'r' at 0x102d3a780> variant_file=<_io.TextIOWrapper name='<stdin>' encoding='utf-8'> logger=<logging.Logger object at 0x102d64250>">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  father  mother  proband father_2    mother_2    proband_2
1   879537  .   T   C   100 PASS    MQ=1;Exonic;Annotation=SAMD11;GeneticModels=1:AR_hom;ModelScore=1:55.0  GT:AD:GQ    0/1:10,10:60    0/1:10,10:60    1/1:10,10:60    0/0:10,10:60    0/1:10,10:60    1/1:10,10:60
1   879541  .   G   A   100 PASS    MQ=1;Exonic;Annotation=SAMD11;GeneticModels=1:AR_hom_dn|AR_hom;ModelScore=1:57.0    GT:AD:GQ    ./. 0/1:10,10:60    1/1:10,10:60    ./. 0/1:10,10:60    0/1:10,10:60
1   879595  .   C   T   100 PASS    MQ=1;Exonic;Annotation=NOC2L,SAMD11;GeneticModels=1:AR_hom_dn;ModelScore=1:55.0 GT:AD:GQ    0/1:10,10:60    0/0:10,10:60    1/1:10,10:60    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60
1   879676  .   G   A   100 PASS    MQ=1;Exonic;Annotation=NOC2L,SAMD11 GT:AD:GQ    0/1:10,10:60    1/1:10,10:60    1/1:10,10:60    0/1:10,10:60    0/1:10,10:60    0/1:10,10:60
1   879911  .   G   A   100 PASS    MQ=1;Exonic;Annotation=NOC2L,SAMD11;Compounds=1:1_880086_T_C|1_880012_A_G;GeneticModels=1:AR_comp|AR_comp_dn;ModelScore=1:55.0  GT:AD:GQ    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60
1   880012  .   A   G   100 PASS    MQ=1;Exonic;Annotation=NOC2L;Compounds=1:1_879911_G_A|1_880086_T_C;GeneticModels=1:AR_comp|AR_comp_dn;ModelScore=1:55.0 GT:AD:GQ    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60
1   880086  .   T   C   100 PASS    MQ=1;Exonic;Annotation=NOC2L;Compounds=1:1_879911_G_A|1_880012_A_G;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55.0   GT:AD:GQ    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60
1   880199  .   G   A   100 PASS    MQ=1;Annotation=NOC2L;GeneticModels=1:AD_dn;ModelScore=1:55.0   GT:AD:GQ    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60
1   880217  .   T   G   100 PASS    MQ=1;Annotation=NOC2L;GeneticModels=1:AD_dn;ModelScore=1:55.0   GT:AD:GQ    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60
10  76154051    .   A   G   100 PASS    MQ=1;Exonic;Annotation=ADK;Compounds=1:10_76154073_T_G;GeneticModels=1:AR_comp_dn;ModelScore=1:55.0 GT:AD:GQ    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60    0/0:10,10:60    0/1:10,10:60    0/1:10,10:60
10  76154073    .   T   G   100 PASS    MQ=1;Exonic;Annotation=ADK;Compounds=1:10_76154051_A_G;GeneticModels=1:AD_dn|AR_comp_dn;ModelScore=1:55.0   GT:AD:GQ    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60    0/0:10,10:60    0/0:10,10:60    0/1:10,10:60
10  76154074    .   C   G   100 PASS    MQ=1;Annotation=ADK GT:AD:GQ    ./. 0/1:10,10:60    0/1:10,10:60    0/1:10,10:60    0/1:10,10:60    0/1:10,10:60
10  76154076    .   G   C   100 PASS    MQ=1;Annotation=ADK;GeneticModels=1:AD_dn|AD;ModelScore=1:57.0  GT:AD:GQ    ./. 0/0:10,10:60    0/1:10,10:60    ./. 0/0:10,10:60    0/1:10,10:60
X   302253  .   CCCTCCTGCCCCT   C   100 PASS    MQ=1;Annotation=PPP2R3B;GeneticModels=1:XD|XR;ModelScore=1:55.0 GT:AD:GQ    0/0:10,10:60    0/1:10,10:60    1/1:10,10:60    0/0:10,10:60    1/1:10,10:60    1/1:10,10:60
MT  302253  .   CCCTCCTGCCCCT   C   100 PASS    MQ=1;GeneticModels=1:AR_hom_dn;ModelScore=1:55.0    GT:AD:GQ    0/0:10,10:60    0/1:10,10:60    1/1:10,10:60    0/0:10,10:60    1/1:10,10:60    1/1:10,10:60

The basic idea with genmod is to make fast and easy analysis of vcf variants for rare disease. It can still be interesting to use in other cases, such as annotating what genetic regions the variants in a bacteria belongs to. genmod can annotate accurate patterns of inheritance in arbitrary sized families. The genetic models checked are the basic mendelian ones, these are:

genmod is made for working on any type of annotated vcf. To get relevant Autosomal Compound Heterozygotes we need to know what genetic regions that the variants belong to. We can use annotations from the Variant Effect Predictor or let genmod do the annotation.

genmod comes annotation set that is made from ensemble. It is possible to use the 37 or 38 build, see genmod annotate --help Any annotation in the bed format can be used.

(There are files for testing the following commands in genmod/examples)

To annotate the variants with user defined regions use

$genmod annotate <vcf_file> -r/--annotate-regions --region-file path_to_regions.bed

Now the variants are ready to get their models annotated:

$genmod models <vcf_file> -f/--family_file <family.ped>