fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
337 stars 46 forks source link

Incorrect zero-based VCF file #202

Open ethering opened 7 months ago

ethering commented 7 months ago

Hi, I'm using SURVIVOR (v1.0.7) for the first time and trying out some test data. I've generated a params file using SURVIVOR simSV test_params.param and then simulated the SVs onto a reference genome. SURVIVOR simSV ref.fasta test_params.param 0.1 0 simulated

Upon viewing the generated simulated.vcf file, I see that the VCF file is using zero-based coordinates when VCF uses 1-based coordinates. Here are the first 16 characters of my reference sequence:

>SJAP_Chr1
CACCAAAAACCCTAAG

Here are the first 16 characters of my simulated sequence:

>SJAP_Chr1
AACAAAAAAGCCTAAA

and here are the lines in the VCF file that relate to the variants (the VCF header states 'source=Sniffles').

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
SJAP_Chr1       0       SNP0SURVIVOR    C       A       PRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1   GT:GL:GQ:FT:RC:DR:DV:RR:RV      1/1
SJAP_Chr1       3       SNP1SURVIVOR    C       A       PRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1   GT:GL:GQ:FT:RC:DR:DV:RR:RV      1/1
SJAP_Chr1       9       SNP2SURVIVOR    C       G       PRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1   GT:GL:GQ:FT:RC:DR:DV:RR:RV      1/1
SJAP_Chr1       15      SNP3SURVIVOR    G       A       PRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1   GT:GL:GQ:FT:RC:DR:DV:RR:RV

As you can see, the first variant is at position zero and then the subsequent co-ordinates are also zero-based. The actual 'POS' values should be 1, 4, 10, and 16.

Based on my initial use of SURVIVOR, I have two other comments:

  1. The 'simulated.bed' file is not in BED format, so perhaps it would be better to give it a different file extension
  2. I can't find any command line help for the simSV method. Typing SURVIVOR simSV --help gives the stdout of Parameter file generated and produces a file called --help
waltergallegog commented 2 weeks ago

Hi @ethering have you been able to overcome this issue?

I'm observing the same behavior, and this issue is breaking some downstream analysis (liftover).

A simple check with

bcftools norm --check-ref e --fasta-ref <fasta_file> <vcf_file> -o <vcf_norm_file>

Gives errors like the following:

Reference allele mismatch at chr7 .. REF_SEQ:'GCTCTTCTCTTCTT' vs VCF:'CTCTTCTCTTCTTA
waltergallegog commented 2 weeks ago

Besides the issue with POS I would like to add that the END information field seems 1-based to me, which is correct according to the vcf specification, but inconsistent with the 0-based POS. This brakes the following relation from the vcf specification:

For precise variants, END is POS + length of REF allele − 1, and the for imprecise variants the corresponding best estimate.

A simple example of the issue showing a diff between Reference and produced fasta, and the corresponding vcf:

Screenshot_20240710_135945

(The lines in the fasta file are of length = 100)

The deletion should have:

POS = 307 (1 based)
LEN = 14
END = 320 (1 based)

307 + 14 -1 = 320
ethering commented 2 weeks ago

Hi @ethering have you been able to overcome this issue?

I'm observing the same behavior, and this issue is breaking some downstream analysis (liftover).

A simple check with

bcftools norm --check-ref e --fasta-ref <fasta_file> <vcf_file> -o <vcf_norm_file>

Gives errors like the following:

Reference allele mismatch at chr7 .. REF_SEQ:'GCTCTTCTCTTCTT' vs VCF:'CTCTTCTCTTCTTA

No. I modified my workflow so that it didn't require the VCF file. One alternative would be to create a script that increments the POS position by one.