Local frequency file handling of indels

damiansm commented 7 years ago

Current code is expecting indels to be handled in the usual Jannovar way e.g. 1 12345 . TG 0.01 rather than 1 12345 A ATG 0.01

and 1 12344 AT . rather than 1 12345 AAT A

More likely that people will supply data in the latter format so I have fixed and committed code to push to handle this scenario

pnrobinson commented 7 years ago

Don't understand -- Jannovar is starting from VCF format (A -> ATG etc), so this shouldnt be a problem?

damiansm commented 7 years ago

No - the variant gets converted somewhere along the line from the VCF format of A -> ATG to . -> TG. I am blaming Jannovar but could be wrong. It is not a big issue as long as we tell the users what format to use and code accordingly!

I think A->ATG is more standard in the community right e.g. when you download files from other sources.

visze commented 7 years ago

No this is not the cause of Jannovar.

I think the Exomiser variant model carries the VariantContext from htsjdk with it. So I think all information you need is stored in the Exomiser model. I think it is the output writer that converts it like that.

julesjacobsen commented 7 years ago

It's definitely a Jannovar thing - https://github.com/exomiser/Exomiser/blob/development/exomiser-core/src/main/java/org/monarchinitiative/exomiser/core/genome/VariantFactory.java#L229-L232

Input VCF:

##fileformat=VCFv4.1
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  manuel
1   123256213   .   CA  CC  100.15  PASS    GENE=RBM8A  GT:DP   1/1:33
1   123256213   .   CA  C   100.15  PASS    GENE=RBM8A  GT:DP   1/1:33
1   123256214   .   A   AC  100.15  PASS    GENE=RBM8A  GT:DP   1/1:33
1   145508800   rs12345678  T   C   123.15  PASS    GENE=RBM8A  GT:DP   1/1:33

Test code in VariantFactoryTest

List<VariantEvaluation> variants = instance.streamVariantEvaluations(vcfPath).collect(toList());
    variants.forEach(varEv -> {
        VariantContext vc = varEv.getVariantContext();
        System.out.printf("%s %s %s %s%n",
                varEv.getPosition(), varEv.getRef(), varEv.getAlt(),
                vc.toString());
    });

Output (columns 1-3 are derived from a Jannovar TranscriptAnnotations object):

123256214 A C [VC Unknown @ 1:123256213-123256214 Q100.15 of type=MNP alleles=[CA*, CC] attr={GENE=RBM8A} GT=[[manuel CC/CC DP 33]]
123256214 A - [VC Unknown @ 1:123256213-123256214 Q100.15 of type=INDEL alleles=[CA*, C] attr={GENE=RBM8A} GT=[[manuel C/C DP 33]]
123256214 - C [VC Unknown @ 1:123256214 Q100.15 of type=INDEL alleles=[A*, AC] attr={GENE=RBM8A} GT=[[manuel AC/AC DP 33]]
145508800 T C [VC Unknown @ 1:145508800 Q123.15 of type=SNP alleles=[T*, C] attr={GENE=RBM8A} GT=[[manuel C/C DP 33]]

It looks like its caused by this line and this line

This is semi-sensible to do as we're not 100% sure whats coming into exomiser and it looks like jannovar normalises the variants in order to annotate them. Ideally here Jannovar wouldn't normalise to an empty string. Is there a way of getting Jannovar to normalise to the VCF spec?

visze commented 7 years ago

Yep. Sorry I was wrong. I always use the jannovar-htsjdk' module and not thejannovar-corealone. Withjannovar-htsjdk` it is not problematic because I always used the VariantContext further.

So yes: The real core model does not know the alt for an deletion or the ref for an insertion. The GenomeVariant does the conversion.

But it should be not problematic at this position. You know the alt_allele_id and also the VariantContext at that position. So you shouldn't use the VariantAnnotations to get ref and alt and simply use the VariantContext.

julesjacobsen commented 7 years ago

I think I'm done with this now. Last relevant commit 78766ee8d6761b3b71782dda2f2733c8a64cd42a.

Exomiser splits all alternate alleles into their own VariantEvaluation and trims each one before getting the annotations from Jannovar see here

This constrains Jannovar to the same genomic position in cases where it would otherwise annotate a position further upstream due to the differences in how it represents variants compared to the VCF spec.

So the local frequency file defined as tab-delimited lines like so. It requires variants to be decomposed into single allele variants which have been left-aligned and trimmed from the left, then the right using 1-based positions. This is 'standard' practice for VCF apparently. e.g. ANNOVAR and MacArthur lab

//chr   pos ref alt freq(%)
1 12345   A   T   23.0  (an A->T SNP on chr1 at position 12345 with frequency of 23.0%)
//note in the usual VCF format these would be on a single line
1 12345   A   TG   0.01  (an A->TG insertion on chr1 at position 12345 with frequency of 0.01%)
1 12345   AT   G   0.02  (an AT->G deletion on chr1 at position 12345 with frequency of 0.02%)
22 12345   A   T   23.0  (an A->T SNP on chr22 at position 12345 with frequency of 23.0%)
//non-autosomes
X 12345   AT   G   0.02  (an AT->G deletion on chrX at position 12345 with frequency of 0.02%)
Y 12345   AT   G   0.02  (an AT->G deletion on chrY at position 12345 with frequency of 0.02%)
M 12345   AT   G   0.02  (an AT->G deletion on chrM at position 12345 with frequency of 0.02%)

exomiser / Exomiser

Local frequency file handling of indels #207