genome / analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
MIT License
102 stars 57 forks source link

misannotation in somatic pipeline? #748

Closed chrisamiller closed 5 years ago

chrisamiller commented 5 years ago

Hi Dave and Feiyu,

Just wanted to note that the latest CLE somatic results do not seem to be reporting the most severe consequence in the variants.annotated.tsv. I only noticed this being an issue for indels.

This doesn't affect vaccine/immunotherapy pipelines b/c all vcf annotations are processed.

If anyone or any process is utilizing the variants.annotated.tsv it would have an effect.

Two examples of an in-frame indel and a frame-shift indel are below. These are reported as upstream variants and intronic variants in the tsv summary

Case directory:

H_MT-8043-005.cle_results_somatic -> /gscmnt/gc13015/cle/IDT_somatic_exome_assay/CI-398/H_MT-8043-005

########

$ zgrep FAM173A annotated_filtered.vcf.gz | cut -f 1-5

chr16 721358 . CGGCTCG C

Note VEP consequence: inframe_deletion|MODERATE|FAM173A|ENSG00000103254|Transcript|ENST00000569529.5|protein_coding

$ zgrep 721358 variants.annotated.tsv

chr16 721358 . CGGCTCG C mutect-varscan-pindel CGGCTCG/CGGCTCG 453,0 0 453 CGGCTCG/C 1402,354 0.20159 1756 downstream_gene_variant CCDC7Transcript ENST00000293889.10

########

$ zgrep OR51F2 annotated_filtered.vcf.gz | cut -f 1-5

chr11 4821935 . AGTTCTATG A

Note: frameshift_variant|HIGH|OR51F2|ENSG00000176925|Transcript|ENST00000641672.1|protein_coding|

$ zgrep 4821935 variants.annotated.tsv

Note VEP consequence: frameshift_variant|HIGH|OR51F2|ENSG00000176925|Transcript|ENST00000641672.1|protein_coding|

chr11 4821935 . AGTTCTATG A Intersection AGTTCTATG/AGTTCTATG 203,0 0 203 AGTTCTATG/A 220,40 0.15385 260 intronvariant MMP26 Transcript ENST00000380390.5 ENST00000380390.5:c.-145+54597-145+54604del HGNC:14249

########

-Mike M.

chrisamiller commented 5 years ago

CLE pipeline results are here: /gscmnt/gc13015/cle/IDT_somatic_exome_assay/CI-398/H_MT-8043-005/

Is certainly relatively recent (includes somalier inputs, etc). CWL that was run is here: /gscmnt/gc13015/cle/IDT_somatic_exome_assay/git/analysis-workflows/definitions/pipelines/gathered_cle_somatic_exome.cwl It matches the current master branch

VEP fields in input.yaml seem sane at first glance:

vep_to_table_fields:
- Consequence
- SYMBOL
- Feature_type
- Feature
- HGVSc
- HGVSp
- cDNA_position
- CDS_position
- Protein_position
- Amino_acids
- Codons
- HGNC_ID
- Existing_variation
- gnomADe_AF
vep_cache_dir: /gscmnt/gc2560/core/cwl/inputs/VEP_cache
vep_ensembl_assembly: GRCh38
vep_ensembl_version: 95
vep_ensembl_species: homo_sapiens
chrisamiller commented 5 years ago

@jhundal , I don't believe anything is wrong here. VEP is being run with the flag_pick option, which means that it's choosing based on these criteria: https://useast.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick

That means that for this case, it does, in fact choose a downstream variant as the most reliable annotation, but I don't think that's a failure.