Open minw2828 opened 1 month ago
Additionally, could @aysegokce provide a reference on the format (such as chr8:141415620-141415787) being an older representation of the tandem duplications in VNTRs, please?
I want to have a look at it as I am not familiar with variable number tandem repeats (VNTRs).
Thank you!
Hello @minw2828,
Thank you for sending the updated vcf format. We will check and update the output accordingly. For the VNTR regions, they offer three options and we are currently using precise_alt2
in our vcfs. This representation works well with the tools we tried. Is there any downstream analysis requiring the other representation?
This is a representation that we used in older versions. We were using this when the vntr region was duplicated (tandem duplication). This was confusing for some tools in downstream analysis; therefore, in the current version (v1.1), we are representing them as <DUP>
, and we added the INSIDE_VNTR=TRUE
field to the INFO
column.
Best Ayse
Hi @aysegokce,
precise_alt2
was mentioned once on VCFv4.5 page 36, but the ALT field was not <chr>:<start>-<end>
either. It was:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
chr1 117 precise_alt2 AG A . . GT:PS 0|1:100
Where did the precise_alt2
come from?
I just want to know more about the history around this.
When I tried importing the outputs of severus v0.1.1 to aws healthomics variant store, healthomics rejected the following vcf record due to ALT field as <chr>:<start>-<end>
.
chr4 40402017 SEVERUS_INS11564 N chr4:40402017-40402312 60 PASS PRECISE;SVTYPE=INS;SVLEN=103;CHR2=chr4;DETAILED_TYPE=.;INSLEN=0;MAPQ=60;SUPPREAD=2;HVAF=0.00|0.00|0.18;CLUSTERID=severus_0;PctSeqSimilarity=0;PctSizeSimilarity=0.5754;PctRecOverlap=0;SizeDiff=76;StartDistance=160;EndDistance=160;GTMatch=1;TruScore=19;MatchId=81588.1.0 GT:GQ:VAF:DR:DV 0/0:242:0.09:21:2
I initially thought the above ALT field represented an insertion matching sequence within chr4:40402017-40402312
simply because this explanation looked intuitive to me, so healthomics should support it. But it seemed that <chr>:<start>-<end>
had caused more confusion in the past; hence, it is no longer used. Do I understand it right?
When I tried running severus v1.1 on the same sample, the previous ALT=chr4:40402017-40402312
was no longer called.
I have not tested severus v1.1 on more samples to see if ALT=<chr>:<start>-<end>
is still shown.
Perhaps it is fine for healthomics not to support ALT=<chr>:<start>-<end>
, if ALT=<chr>:<start>-<end>
is fading out of the SV world? It will be good to hear your thoughts on this.
Many thanks, Min
Hello @minw2828,
precise_alt2
representation is the standard insertion representation in the vcf format, which we are using in the current version. The previous representation with <chr>:<start>-<end>
was not in the standard vcf format, so we fixed this issue by converting all those entries to <DUP>
and specifying it is a duplication of a VNTR we added INSIDE_VNTR=TRUE
field to the INFO
column.
Best Ayse
Hello,
Thank you for creating this tool.
This issue is related to a closed issue #19, which I cannot reopen.
The latest VCF specification (version 4.5) which was released on 28 Jun 2024, has section 5.7 devoted to STR and VNTR. https://github.com/samtools/hts-specs/blob/master/VCFv4.5.pdf
Would you consider aligning the Severus output format with the latest VCF specification in a future release? This would help streamline downstream tools.
Thank you!