KolmogorovLab / Severus

A tool for somatic structural variant calling using long reads
Other
99 stars 4 forks source link

VNTRs #23

Open minw2828 opened 1 month ago

minw2828 commented 1 month ago

Hello,

Thank you for creating this tool.

This issue is related to a closed issue #19, which I cannot reopen.

The latest VCF specification (version 4.5) which was released on 28 Jun 2024, has section 5.7 devoted to STR and VNTR. https://github.com/samtools/hts-specs/blob/master/VCFv4.5.pdf

Would you consider aligning the Severus output format with the latest VCF specification in a future release? This would help streamline downstream tools.

Thank you!

minw2828 commented 1 month ago

Additionally, could @aysegokce provide a reference on the format (such as chr8:141415620-141415787) being an older representation of the tandem duplications in VNTRs, please?

I want to have a look at it as I am not familiar with variable number tandem repeats (VNTRs).

Thank you!

aysegokce commented 1 month ago

Hello @minw2828,

Thank you for sending the updated vcf format. We will check and update the output accordingly. For the VNTR regions, they offer three options and we are currently using precise_alt2in our vcfs. This representation works well with the tools we tried. Is there any downstream analysis requiring the other representation?

This is a representation that we used in older versions. We were using this when the vntr region was duplicated (tandem duplication). This was confusing for some tools in downstream analysis; therefore, in the current version (v1.1), we are representing them as <DUP>, and we added the INSIDE_VNTR=TRUE field to the INFO column.

Best Ayse

minw2828 commented 1 month ago

Hi @aysegokce,

precise_alt2 was mentioned once on VCFv4.5 page 36, but the ALT field was not <chr>:<start>-<end> either. It was: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample chr1 117 precise_alt2 AG A . . GT:PS 0|1:100

Where did the precise_alt2 come from? I just want to know more about the history around this.

When I tried importing the outputs of severus v0.1.1 to aws healthomics variant store, healthomics rejected the following vcf record due to ALT field as <chr>:<start>-<end>. chr4 40402017 SEVERUS_INS11564 N chr4:40402017-40402312 60 PASS PRECISE;SVTYPE=INS;SVLEN=103;CHR2=chr4;DETAILED_TYPE=.;INSLEN=0;MAPQ=60;SUPPREAD=2;HVAF=0.00|0.00|0.18;CLUSTERID=severus_0;PctSeqSimilarity=0;PctSizeSimilarity=0.5754;PctRecOverlap=0;SizeDiff=76;StartDistance=160;EndDistance=160;GTMatch=1;TruScore=19;MatchId=81588.1.0 GT:GQ:VAF:DR:DV 0/0:242:0.09:21:2

I initially thought the above ALT field represented an insertion matching sequence within chr4:40402017-40402312 simply because this explanation looked intuitive to me, so healthomics should support it. But it seemed that <chr>:<start>-<end> had caused more confusion in the past; hence, it is no longer used. Do I understand it right?

When I tried running severus v1.1 on the same sample, the previous ALT=chr4:40402017-40402312 was no longer called. I have not tested severus v1.1 on more samples to see if ALT=<chr>:<start>-<end> is still shown. Perhaps it is fine for healthomics not to support ALT=<chr>:<start>-<end>, if ALT=<chr>:<start>-<end> is fading out of the SV world? It will be good to hear your thoughts on this.

Many thanks, Min

aysegokce commented 1 month ago

Hello @minw2828, precise_alt2 representation is the standard insertion representation in the vcf format, which we are using in the current version. The previous representation with <chr>:<start>-<end> was not in the standard vcf format, so we fixed this issue by converting all those entries to <DUP> and specifying it is a duplication of a VNTR we added INSIDE_VNTR=TRUE field to the INFO column.

Best Ayse