PacificBiosciences / trgt

Tandem repeat genotyping and visualization from PacBio HiFi data
Other
99 stars 7 forks source link

VCF Spec #23

Open themkdemiiir opened 9 months ago

themkdemiiir commented 9 months ago

Hi,

I noticed you currently use version 4.3 of the VCF specification for tandem repeats. However, version 4.3 does not provide guidelines on handling tandem repeats, whereas version 4.4 does. Do you plan to follow the guidelines provided in version 4.4? Additionally, would you consider splitting the tandem repeats, as it can be challenging to annotate them if they are in the same structure as (AT)nTCG(GC)n?

Thank you.

themkdemiiir commented 9 months ago

Could you also share an example TRGT vcf file with me? Thanks

egor-dolzhenko commented 9 months ago

Thanks for the questions. Note that the <CNV:TR> variants introduced in the 4.4 specification are designed for situations "when the exact [TR] sequence is not known". TRGT outputs full-length TR sequences and so does not currently use this variant representation.

And I agree that splitting complex repeat regions into constituent simple TRs can be helpful. We are planning to create some helper tools to decompose / annotate repeats after VCFs were generated. (Splitting complex tandem repeats into multiple VCF records can significantly complicate analyses involving multiple samples and also analyses of regions containing large clusters of simple TRs.) What kind of annotation are you interested in?

themkdemiiir commented 9 months ago

I had initially planned to use VEP to annotate the consequences of TR in transcripts. However, due to the (AT)nTCG(GC)n structure of TRs, VEP could not annotate them accurately. Therefore, handling each TR as a separate VCF line would be much easier. I would appreciate having an option to split it accordingly.

egor-dolzhenko commented 9 months ago

Thanks for clarifying. I have very little experience with VEP, but I will try to learn how it works and see what we could do to make TRGT more compatible with this tool. In particular, we will definitely consider providing some kind of option to split complex repeats into constituent simple repeats either during VCF generation or after.

Also, if you are interested in annotating variants with VEP, it might be better to use general-purpose variant calling tools that VEP was designed for. We are working on our own TR-specific annotation engine that would annotate unusual expansions and composition changes within the repeat sequence.

Another idea is to apply variant normalization to TRGT VCFs. The resulting normalized VCFs might be more amenable to analysis with VEP.