bioinfo-chru-strasbourg / howard

Highly Open Workflow for Annotation & Ranking toward genomic variant Discovery
GNU Affero General Public License v3.0
6 stars 2 forks source link

Add transcripts level #256

Open antonylebechec opened 1 month ago

antonylebechec commented 1 month ago

In order to explore transcripts information related to each variant, especially to calculate scores, need to create a "transcript view". It can be another table or a view (e.g. "transcripts"), which each line correspond to a transcript (i.e. multiple lines for a variant). A transcript ID column as a uniq key is needed.

antonylebechec commented 1 month ago

To create a transcript view, some parameters are needed. As an example, this param identify a table to generate (transcripts), and a structure corresponding to columns dedicated to transcripts, such as :

{
            "transcripts": {
                "table": "transcripts",
                "struct": {
                    "from_column_format": [
                        {
                            "transcripts_column": "ANN",
                            "transcripts_infos_column": "Feature_ID"
                        }
                    ],
                    "from_columns_map": [
                        {
                            "transcripts_column": "Ensembl_transcriptid",
                            "transcripts_infos_columns": [
                                "genename",
                                "Ensembl_geneid",
                                "LIST_S2_score",
                                "LIST_S2_pred"
                            ]
                        },
                        {
                            "transcripts_column": "Ensembl_transcriptid",
                            "transcripts_infos_columns": [
                                "genename",
                                "VARITY_R_score",
                                "Aloft_pred"
                            ]
                        }
                    ]
                }
            }
        }

This param is used with function Variants.create_transcript_view() to generate a transcripts table:

   #CHROM       POS REF ALT       transcript     transcript_1 AAposAAlength Distance Allele Aloft_pred          HGVSc  ... cDNAposcDNAlength    genename       FeatureID LIST_S2_pred ERRORSWARNINGSINFO VARITY_R_score      GeneID                          Annotation  GeneName_1        HGVSp AnnotationImpact
0    chr1     28736   A   C      NR_024540.1      NR_024540.1          None     None      C       None    n.50+585T>G  ...              None      WASH7P     NR_024540.1         None               None           None      WASH7P                      intron_variant      WASH7P         None         MODIFIER
1    chr1     28736   A   C      NR_036051.1      NR_036051.1          None   1630.0      C       None     n.-1630A>C  ...              None   MIR1302-2     NR_036051.1         None               None           None   MIR1302-2               upstream_gene_variant   MIR1302-2         None         MODIFIER
2    chr1     28736   A   C      NR_036266.1      NR_036266.1          None   1630.0      C       None     n.-1630A>C  ...              None   MIR1302-9     NR_036266.1         None               None           None   MIR1302-9               upstream_gene_variant   MIR1302-9         None         MODIFIER
3    chr1     28736   A   C      NR_036267.1      NR_036267.1          None   1630.0      C       None     n.-1630A>C  ...              None  MIR1302-10     NR_036267.1         None               None           None  MIR1302-10               upstream_gene_variant  MIR1302-10         None         MODIFIER
4    chr1     28736   A   C      NR_036268.1      NR_036268.1          None   1630.0      C       None     n.-1630A>C  ...              None  MIR1302-11     NR_036268.1         None               None           None  MIR1302-11               upstream_gene_variant  MIR1302-11         None         MODIFIER
5    chr1     35144   A   C      NR_026818.1      NR_026818.1          None     None      C       None       n.597T>G  ...              None     FAM138A     NR_026818.1         None               None           None     FAM138A  non_coding_transcript_exon_variant     FAM138A         None         MODIFIER
6    chr1     35144   A   C      NR_026820.1      NR_026820.1          None     None      C       None       n.597T>G  ...              None     FAM138F     NR_026820.1         None               None           None     FAM138F  non_coding_transcript_exon_variant     FAM138F         None         MODIFIER
7    chr1     35144   A   C      NR_026822.1      NR_026822.1          None     None      C       None       n.597T>G  ...              None     FAM138C     NR_026822.1         None               None           None     FAM138C  non_coding_transcript_exon_variant     FAM138C         None         MODIFIER
8    chr1     35144   A   C      NR_036051.1      NR_036051.1          None   4641.0      C       None     n.*4641A>C  ...              None   MIR1302-2     NR_036051.1         None               None           None   MIR1302-2             downstream_gene_variant   MIR1302-2         None         MODIFIER
9    chr1     35144   A   C      NR_036266.1      NR_036266.1          None   4641.0      C       None     n.*4641A>C  ...              None   MIR1302-9     NR_036266.1         None               None           None   MIR1302-9             downstream_gene_variant   MIR1302-9         None         MODIFIER
10   chr1     35144   A   C      NR_036267.1      NR_036267.1          None   4641.0      C       None     n.*4641A>C  ...              None  MIR1302-10     NR_036267.1         None               None           None  MIR1302-10             downstream_gene_variant  MIR1302-10         None         MODIFIER
11   chr1     35144   A   C      NR_036268.1      NR_036268.1          None   4641.0      C       None     n.*4641A>C  ...              None  MIR1302-11     NR_036268.1         None               None           None  MIR1302-11             downstream_gene_variant  MIR1302-11         None         MODIFIER
12   chr1     69101   A   G  ENST00000335137  ENST00000335137          None     None   None          .           None  ...              None       OR4F5            None            T               None     0.27627227        None                                None       OR4F5         None             None
13   chr1     69101   A   G  ENST00000641515  ENST00000641515          None     None   None          .           None  ...              None       OR4F5            None            T               None              .        None                                None       OR4F5         None             None
14   chr1     69101   A   G   NM_001005484.1   NM_001005484.1         4/305     None      G       None        c.11A>G  ...            11/918       OR4F5  NM_001005484.1         None               None           None       OR4F5                    missense_variant       OR4F5    p.Glu4Gly         MODERATE
15   chr1    768251   A   G      NR_047519.1      NR_047519.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047519.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
16   chr1    768251   A   G      NR_047521.1      NR_047521.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047521.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
17   chr1    768251   A   G      NR_047523.1      NR_047523.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047523.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
18   chr1    768251   A   G      NR_047524.1      NR_047524.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047524.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
19   chr1    768251   A   G      NR_047525.1      NR_047525.1          None     None      G       None  n.154+3767A>G  ...              None   LINC01128     NR_047525.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
20   chr1    768251   A   G      NR_047526.1      NR_047526.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047526.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
21   chr1    768252   A   G      NR_047519.1      NR_047519.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047519.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
22   chr1    768252   A   G      NR_047521.1      NR_047521.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047521.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
23   chr1    768252   A   G      NR_047523.1      NR_047523.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047523.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
24   chr1    768252   A   G      NR_047524.1      NR_047524.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047524.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
25   chr1    768252   A   G      NR_047525.1      NR_047525.1          None     None      G       None  n.154+3768A>G  ...              None   LINC01128     NR_047525.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
26   chr1    768252   A   G      NR_047526.1      NR_047526.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047526.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
27   chr1    768253   A   G      NR_047519.1      NR_047519.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047519.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
28   chr1    768253   A   G      NR_047521.1      NR_047521.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047521.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
29   chr1    768253   A   G      NR_047523.1      NR_047523.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047523.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
30   chr1    768253   A   G      NR_047524.1      NR_047524.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047524.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
31   chr1    768253   A   G      NR_047525.1      NR_047525.1          None     None      G       None  n.154+3769A>G  ...              None   LINC01128     NR_047525.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
32   chr1    768253   A   G      NR_047526.1      NR_047526.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047526.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
33   chr7  55249063   G   A   NM_001346897.2   NM_001346897.2      742/1091     None      A       None      c.2226G>A  ...         2487/3848        EGFR  NM_001346897.2         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln742Gln              LOW
34   chr7  55249063   G   A   NM_001346898.2   NM_001346898.2      787/1136     None      A       None      c.2361G>A  ...         2622/3983        EGFR  NM_001346898.2         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln787Gln              LOW
35   chr7  55249063   G   A   NM_001346899.1   NM_001346899.1      742/1165     None      A       None      c.2226G>A  ...         2483/6218        EGFR  NM_001346899.1         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln742Gln              LOW
36   chr7  55249063   G   A   NM_001346900.2   NM_001346900.2      734/1157     None      A       None      c.2202G>A  ...         2393/9676        EGFR  NM_001346900.2         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln734Gln              LOW
37   chr7  55249063   G   A   NM_001346941.2   NM_001346941.2       520/943     None      A       None      c.1560G>A  ...         1821/9104        EGFR  NM_001346941.2         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln520Gln              LOW
38   chr7  55249063   G   A      NM_005228.5      NM_005228.5      787/1210     None      A       None      c.2361G>A  ...         2622/9905        EGFR     NM_005228.5         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln787Gln              LOW
39   chr7  55249063   G   A      NR_047551.1      NR_047551.1          None     None      A       None      n.1201C>T  ...              None    EGFR-AS1     NR_047551.1         None               None           None    EGFR-AS1  non_coding_transcript_exon_variant    EGFR-AS1         None         MODIFIER
antonylebechec commented 1 month ago

Calculation to add transcripts annotations as a field in INFO in JSON format. Example (create config/param.transcripts.json with param from help):

howard calculation --input="tests/data/example.ann.transcripts.vcf.gz" --output="/tmp/output.transcript.vcf" --calculations="TRANSCRIPTS_JSON" --param="config/param.transcripts.json"
antonylebechec commented 2 weeks ago

Prioritization of transcripts in 'HOWARD' mode with 'transcripts' profiles available in a configuration JSON file, with 'PZT' as prefix:

"transcripts": {
  ...
  "prioritization": {
     "profiles": ["transcripts"],
     "prioritization_config": "config/prioritization_transcripts_profiles.json",
     "pzprefix": "PZT",
     "prioritization_score_mode": "HOWARD"
  }
}

With prioritization parameters based on 'LIST_S2_score' (file 'config/prioritization_transcripts_profiles.json'):

{
  "transcripts": {
    "LIST_S2_score": [
      {
        "type": "gt",
        "value": "0.75",
        "score": 10,
        "flag": "PASS",
        "comment": ["Very Good LIST Score"]
      },
      {
        "type": "gt",
        "value": "0.50",
        "score": 10,
        "flag": "PASS",
        "comment": ["Good LIST Score"]
      }
    ]
  }
}

Command:

howard calculation --input='tests/data/example.dbnsfp.transcripts.vcf.gz' --output='/tmp/example.calculation.transcripts.vcf' --param='config/param.transcripts.json' --calculations='TRANSCRIPTS_PRIORITIZATION'

Output VCF with PZTTranscript, PZTScore and PZTFlag (partial output):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    28736   .       A       C       100     PASS    CLNSIG=pathogenic
chr1    35144   .       A       C       100     PASS    CLNSIG=non-pathogenic
chr1    69101   .       A       G       100     PASS    genename=OR4F5;Ensembl_transcriptid=ENST00000641515,ENST00000335137;LIST_S2_score=0.79822,0.716128;PZTTranscript=ENST00000641515;PZTScore=20;PZTFlag=PASS
antonylebechec commented 2 weeks ago

Include transcripts annotations, either in JSON format or structured format (like 'snpEff'), with calculation tool.

Parameters in json file (e.g. 'config/param.transcripts.json'):

{
  "transcripts": {
    "transcripts_info_field_json": "transcripts_json",
    "transcripts_info_field_format": "transcripts_ann",
    "table": "transcripts",
    "struct": {...}
    ...
}

Command:

howard calculation --input='tests/data/example.ann.transcripts.vcf.gz' --output='/tmp/example.calculation.transcripts.vcf' --param='config/param.transcripts.json' --calculations='TRANSCRIPTS_ANNOTATIONS'

Output VCF with 'transcripts_json' and 'transcripts_ann' INFO fields (partial output):

##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
##INFO=<ID=transcripts_json,Number=.,Type=String,Description="Transcripts in JSON format">
##INFO=<ID=transcripts_ann,Number=.,Type=String,Description="Transcripts annotations: 'transcript | VARITY_R_score | transcript_1 | Annotation | FeatureID | Allele | HGVSc | Aloft_pred | HGVSp | TranscriptBioType | Distance | genename | LIST_S2_score | AAposAAlength | GeneID | Ensembl_geneid | Rank | GeneName_1 | ERRORSWARNINGSINFO | FeatureType | LIST_S2_pred | CDSposCDSlength | cDNAposcDNAlength | AnnotationImpact'">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    69101   .       A       G       100     PASS    ANN=G|missense_variant|...;genename=OR4F5;Ensembl_transcriptid=ENST00000641515,ENST00000335137;LIST_S2_score=0.79822,0.716128;transcripts_json={"ENST00000335137":{"VARITY_R_score":"0.27627227","transcript_1":"ENST00000335137","Annotation":null,"FeatureID":null,"Allele":null,"HGVSc":null,"Aloft_pred":".","HGVSp":null,"TranscriptBioType":null,"Distance":null,"genename":"OR4F5","LIST_S2_score":"0.716128","AAposAAlength":null,"GeneID":null,"Ensembl_geneid":"ENSG00000186092","Rank":null,"GeneName_1":"OR4F5","ERRORSWARNINGSINFO":null,"FeatureType":null,"LIST_S2_pred":"T","CDSposCDSlength":null,"cDNAposcDNAlength":null,"AnnotationImpact":null},"ENST00000641515":{"VARITY_R_score":".","transcript_1":"ENST00000641515","Annotation":null,"FeatureID":null,"Allele":null,"HGVSc":null,"Aloft_pred":".","HGVSp":null,"TranscriptBioType":null,"Distance":null,"genename":"OR4F5","LIST_S2_score":"0.79822","AAposAAlength":null,"GeneID":null,"Ensembl_geneid":"ENSG00000186092","Rank":null,"GeneName_1":"OR4F5","ERRORSWARNINGSINFO":null,"FeatureType":null,"LIST_S2_pred":"T","CDSposCDSlength":null,"cDNAposcDNAlength":null,"AnnotationImpact":null},"NM_001005484.1":{"VARITY_R_score":null,"transcript_1":"NM_001005484.1","Annotation":"missense_variant","FeatureID":"NM_001005484.1","Allele":"G","HGVSc":"c.11A>G","Aloft_pred":null,"HGVSp":"p.Glu4Gly","TranscriptBioType":"protein_coding","Distance":null,"genename":"OR4F5","LIST_S2_score":null,"AAposAAlength":"4/305","GeneID":"OR4F5","Ensembl_geneid":null,"Rank":"1/1","GeneName_1":"OR4F5","ERRORSWARNINGSINFO":null,"FeatureType":"transcript","LIST_S2_pred":null,"CDSposCDSlength":"11/918","cDNAposcDNAlength":"11/918","AnnotationImpact":"MODERATE"}};transcripts_ann=ENST00000335137|0.27627227|ENST00000335137|||||.||||OR4F5|0.716128|||ENSG00000186092||OR4F5|||T|||,ENST00000641515|.|ENST00000641515|||||.||||OR4F5|0.79822|||ENSG00000186092||OR4F5|||T|||,NM_001005484.1||NM_001005484.1|missense_variant|NM_001005484.1|G|c.11A>G||p.Glu4Gly|protein_coding||OR4F5||4/305|OR4F5||1/1|OR4F5||transcript||11/918|11/918|MODERATE
antonylebechec commented 2 weeks ago

In order to consider also variants' annotations into transcripts prioritization, INFO column of VCF is included into the transcripts view/bubble. Thus, it is now allowed to parameterize prioritization profiles for transcripts with annotations from variants.

Here is a example of a parametrization with an annotation from transcripts 'LIST_S2_score' and an annotation from variants 'CLNSIG':

{
  "transcripts": {
    "LIST_S2_score": [
      {
        "type": "gt",
        "value": "0.75",
        "score": 10,
        "flag": "PASS",
        "comment": ["Very Good LIST Score"]
      },
      {
        "type": "gt",
        "value": "0.50",
        "score": 10,
        "flag": "PASS",
        "comment": ["Good LIST Score"]
      }
    ],
    "CLNSIG": [
      {
        "type": "eq",
        "value": "pathogenic",
        "score": 100,
        "flag": "PASS",
        "comment": ["Pathogenic"]
      }
    ]
  }
}
antonylebechec commented 2 weeks ago

TODO: