fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
337 stars 46 forks source link

insertions.fa contains no insertion sequences #205

Open ethering opened 6 months ago

ethering commented 6 months ago

Hi, I'm running SURVIVOR v1.0.7 and I'm generating a simulated genome sequence with SVs in order to map my own reads to it and call SVs. First I'm generating a parameters file:

$ SURVIVOR simSV test_params.param

Output (I've increased the INDEL_value to ensure insertions):

PARAMETER FILE: DO JUST MODIFY THE VALUES AND KEEP THE SPACES!
DUPLICATION_minimum_length: 100
DUPLICATION_maximum_length: 10000
DUPLICATION_number: 3
INDEL_minimum_length: 20
INDEL_maximum_length: 500
INDEL_number: 10
TRANSLOCATION_minimum_length: 1000
TRANSLOCATION_maximum_length: 3000
TRANSLOCATION_number: 2
INVERSION_minimum_length: 600
INVERSION_maximum_length: 800
INVERSION_number: 4
INV_del_minimum_length: 600
INV_del_maximum_length: 800
INV_del_number: 2
INV_dup_minimum_length: 600
INV_dup_maximum_length: 800
INV_dup_number: 2

Then I generated a simulated reference sequence (option 3=1) to generate the SVs:

$ SURVIVOR simSV reference.fasta test_params.param 0.1 1 simulated
# Chrs passed size threshold:4
generate SV
apply mut ref!
apply: Mt 21146 4
apply: Mt 42091 4
apply: Mt 43332 4
apply: Chr3 45180 1
apply: Chr3 344508 2
apply: Chr2 809100 2
apply: Chr3 869844 1
apply: Chr1 1336924 4
apply: Chr2 1360145 2
apply: Chr3 2220985 2
apply: Chr1 1233970 3
apply: Chr2 3842835 4
apply: Chr2 4354418 1
apply: Chr1 4596703 4
apply: Chr2 860780 3
apply: Chr1 4982876 1
Post SV simulation Genome checking:
generate SNP
write genome
write SV
Done: SV+SNP simulated

So..... Sometimes when I run SURVIVOR simSV to generate the SVs simulated.insertions.fa is totally empty, and sometimes it's not empty, but contains only the fasta header line of the insertions:

$ cat simulated.insertions.fa 
>Chr3_45180

>Chr3_869844

>Chr2_4354418

>Chr1_4982876

I've run SURVIVOR simSV a number of times, using around 5 different param files (using different SV min/max sizes) and this behaviour is constant. However, when I run simSV with option 3=0, my insertions.fa file contains the insertions.

Perhaps I've misunderstood something here, but intuitively I would presume that using option 3=1 (simulate genome), the insertions.fa would be the actual insertions in the simulated genome as using option3=0 (simulate reads), insertions.fa would be empty as the insertions are generated by SURVIVOR simreadswhich doesn't require the insertions.fa file.

fritzsedlazeck commented 6 months ago

Hi Graham, sorry for this. Do you see the ins in the VCF file ? Thanks Fritz

ethering commented 6 months ago

Hi Friz, Yes, they're at the end of the VCF file. Here are the VCF entries

Chr3    45180   INS1487952SURVIVOR  N   <INS>   .   LowQual PRECISE;SVTYPE=INS;SVMETHOD=SURVIVOR_sim;CHR2=Chr3;END=45341;SVLEN=161  GT:GL:GQ:FT:RC:DR:DV:RR:RV  1/1
Chr3    869844  INS1487955SURVIVOR  N   <INS>   .   LowQual PRECISE;SVTYPE=INS;SVMETHOD=SURVIVOR_sim;CHR2=Chr3;END=870223;SVLEN=379 GT:GL:GQ:FT:RC:DR:DV:RR:RV  1/1
Chr2    4354418 INS1487961SURVIVOR  N   <INS>   .   LowQual PRECISE;SVTYPE=INS;SVMETHOD=SURVIVOR_sim;CHR2=Chr2;END=4354910;SVLEN=492    GT:GL:GQ:FT:RC:DR:DV:RR:RV  1/1
Chr1    4982876 INS1487964SURVIVOR  N   <INS>   .   LowQual PRECISE;SVTYPE=INS;SVMETHOD=SURVIVOR_sim;CHR2=Chr1;END=4983124;SVLEN=248    GT:GL:GQ:FT:RC:DR:DV:RR:RV  1/1

Also, when I map real reads to the simulated reference with Minimap2, and use Sniffles to call SVs, I also get them reported in eval_simulated_right.vcf

fritzsedlazeck commented 6 months ago

ok that might be the best workaround for now. Sorry about this . Lately I was more focused on the VCF file than the fasta file.. Cheers Fritz