KolmogorovLab / Severus

A tool for somatic structural variant calling using long reads
Other
102 stars 4 forks source link

The reference sequence is masked with N #21

Open DexinYang1998 opened 4 months ago

DexinYang1998 commented 4 months ago

Hi,

Thanks for developing such a wonderful tool for long-read analysis. Recently, I performed Severus to call somatic mutations between paired tumor-normal samples using the command --target-bam COLO829_merge_haplotagged.bam --control-bam ../colo829bl/COLO829BL_merge_haplotagged.bam --out-dir severus_out -t 80 --phasing-vcf /data/dyang11/COLO829/phasing2/HiC_ONT_haplotype.phased.VCF --vntr-bed /data/dyang11/index/repeats/chm13_VNTR.bed --use-supplementary-tag --vaf-thr 0.05 --min-support 5 --output-read-ids.

The command indeed works but the reference sequence of the output vcf file is N-masked. For example:

chr1    72963755    severus_DEL230  N   <DEL>   60.0    PASS    PRECISE;SVTYPE=DEL;SVLEN=7696;END=72971451;STRANDS=+-;MAPQ=60.0;PHASESETID=50486685;HP=1    GT:GQ:VAF:hVAF:DR:DV    0/0:242:0.17:0.00,1.00,0.00:24:5
chr1    207054623   severus_DEL464  N   <DEL>   60.0    PASS    IMPRECISE;SVTYPE=DEL;SVLEN=33618;END=207088241;STRANDS=+-;MAPQ=60.0;PHASESETID=0;HP=2   GT:GQ:VAF:hVAF:DR:DV    0/1:629:0.29:0.00,0.00,1.00:79:33
chr1    223647755   severus_DUP499  N   <DUP>   60.0    PASS    PRECISE;SVTYPE=DUP;SVLEN=153463;END=223801218;STRANDS=-+;DETAILED_TYPE=tandem_duplication;MAPQ=60.0;PHASESETID=0;HP=2   GT:GQ:VAF:hVAF:DR:DV    0/1:603:0.31:0.00,0.00,1.00:129:59
chr1    223783960   severus_DEL500  N   <DEL>   60.0    PASS    PRECISE;SVTYPE=DEL;SVLEN=3187;END=223787147;STRANDS=+-;MAPQ=60.0;PHASESETID=0;HP=2  GT:GQ:VAF:hVAF:DR:DV    0/1:680:0.25:0.00,0.00,1.00:182:62

Can you help me figure it out? Thank you very much!

aysegokce commented 4 months ago

Hello, In severus, we only provide a sequence for insertions (in the ALT column). For deletion and duplications, we provide the reference position instead.

Best Ayse

minw2828 commented 2 months ago

Hello,

When I ran quay/biocontainers/severus:0.1.1--pyhdfd78af_0, I also observed an insertion with a REF field as N, although the real underlying reference sequence is A (manually checked).

chr4 40402017 SEVERUS_INS11564 N chr4:40402017-40402312 60 PASS PRECISE;SVTYPE=INS;SVLEN=103;CHR2=chr4;DETAILED_TYPE=.;INSLEN=0;MAPQ=60;SUPPREAD=2;HVAF=0.00|0.00|0.18;CLUSTERID=severus_0;PctSeqSimilarity=0;PctSizeSimilarity=0.5754;PctRecOverlap=0;SizeDiff=76;StartDistance=160;EndDistance=160;GTMatch=1;TruScore=19;MatchId=81588.1.0 GT:GQ:VAF:DR:DV 0/0:242:0.09:21:2

I am having trouble running version 1.1 at the moment due to a spike in memory usage.

I just want to flag this. Hope it will be resolved in the future release.

Ideally, as illustrated in the VCF specification (version 4.5) section 5.7 page 38, the reference sequence is written as it is. Ref: https://github.com/samtools/hts-specs/blob/master/VCFv4.5.pdf

Many thanks, Min

aysegokce commented 2 months ago

Hello @minw2828,

We are currently not providing a reference sequence, so we keep the REF column as N as it is widely used in SV vcfs. If you are using truvari with severus output, I suggest running it with --pctseq 0. Alternatively, we developed minda for benchmarking and vcf merging for cancer genomes.

For the memory usage issue with v1.1, can you please share the command line and the severus log file?

Best Ayse