fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
354 stars 47 forks source link

Terminate program as it could not find a non overlapping region #33

Closed YiweiNiu closed 6 years ago

YiweiNiu commented 6 years ago

Hi,

I ran simSV with SURVIVOR simSV $REFERENCE parameter_file 0 1 test but got the following error:

# Chrs passed size threshold:24
generate SV
Terminate program as it could not find a non overlapping region

The reference I used was hg38 and the parameter file was like:

PARAMETER FILE: DO JUST MODIFY THE VALUES AND KEEP THE SPACES!
DUPLICATION_minimum_length: 50
DUPLICATION_maximum_length: 1000000
DUPLICATION_number: 2500
INDEL_minimum_length: 50
INDEL_maximum_length: 1000000
INDEL_number: 2500
TRANSLOCATION_minimum_length: 50
TRANSLOCATION_maximum_length: 1000000
TRANSLOCATION_number: 2500
INVERSION_minimum_length: 50
INVERSION_maximum_length: 1000000
INVERSION_number: 2500
INV_del_minimum_length: 50
INV_del_maximum_length: 1000000
INV_del_number: 2500
INV_dup_minimum_length: 50
INV_dup_maximum_length: 1000000
INV_dup_number: 2500

Do you know how this happened? Thanks in advance!

fritzsedlazeck commented 6 years ago

Hi, So what Survivor dies is to find non overlapping regions to simulate these variants. I implemented it in this way that it tries multiple times to find a new location before it gives up and reports this.

Now the human genome is huge, but I never tried to simulate that many. Could you try to reduce the number of SVs and run it again just to see if Survivor handles you fasts file correctly?

Thanks Fritz

YiweiNiu commented 6 years ago

Thank you for your quick reply!

When I reduced the number of each SV type to 1000, I got the folloing error. And when I set this number to 500, still got the same one.

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr
Aborted (core dumped)

When I reduced this number to 100, it ran successfully. But I have some qustions about the number:

# the parameter file
$cat parameter_file1
PARAMETER FILE: DO JUST MODIFY THE VALUES AND KEEP THE SPACES!
DUPLICATION_minimum_length: 50
DUPLICATION_maximum_length: 1000000
DUPLICATION_number: 1001
INDEL_minimum_length: 50
INDEL_maximum_length: 1000000
INDEL_number: 100
TRANSLOCATION_minimum_length: 50
TRANSLOCATION_maximum_length: 1000000
c_number: 100
INVERSION_minimum_length: 50
INVERSION_maximum_length: 1000000
INVERSION_number: 100
INV_del_minimum_length: 50
INV_del_maximum_length: 1000000
c_number: 100
INV_dup_minimum_length: 50
INV_dup_maximum_length: 1000000
INV_dup_number: 100

# command to run simSV
$path2SURVIVOR simSV ../refer/Homo_sapiens_assembly38.regular.fasta parameter_file1 0 1 test1

# SV numbers
$cut -f 5 test.bed |sort|uniq -c
     60 DEL
     40 INS
    100 INV
    200 TRA

I thought each type of SVs designated in the parameter file would be 100. I guess INVERSION, INV_dup, and INV_del all belong to INV? and 100 INDEL means num(INS) + num(DEL)=100? But TRA (TRANSLOCATION) was 200.

I have two follow-up questions:

Any help would be greatly appreciated.

fritzsedlazeck commented 6 years ago

Hi,

  1. which reference build are you using?

  2. Yes indes are randomly chosen to be ins and del. INV is clear. TRA are a bit tricky because the way TRA are encoded are just the breakpoints. So for each translocation (swap of regions in this case) you will get two breakpoints reported branching between chromosomes.

  3. SURVIVOR picks a length within the min max intervall specified by you (e.g. 50 - 1MB) at random. Next it chooses a location randomly on the genome to alter the genome accordingly.

  4. Yes all the SV generated are homozygous right now. You mean heterozygous? You could run a second simulation and combine the two then they would be heterozygous. I am planing to extend this, but I dont know when this will happen.

YiweiNiu commented 6 years ago

Thank you for your reply!

I used the hg38 as reference, which was downloaded from GATK bundle. I only included 1-22, X, Y and Mt chromosomes.

About simulating heterozygous SVs, could you please explain more? How to combine two simulations? I'm quite new to this. Should I simulate reads based on both unmodified genome and modified genome, then combine the reads?

fritzsedlazeck commented 6 years ago

You could run one simulation and combine the reads simulated from that genome + the reference genome. Or you could simulate two genomes and combine the reads from that. Its both the same principle. I often use mason or dwgsim for simulating short reads. For long reads I use SURVIVOR. I hope that helps. I will look into the simulation next week and hopefully can improve it.

YiweiNiu commented 6 years ago

Thank you very much! It's very helpful 👍