davidebolo1993 / VISOR

VarIant SimulatOR for short, long and linked reads
GNU Lesser General Public License v3.0
41 stars 11 forks source link

Unable to generate 1bp deletion #9

Closed cfriedline closed 4 years ago

cfriedline commented 4 years ago

We discovered this recently and, at your request @davidebolo1993, are opening this issue.

Given the sequence below

>test
ACGTAGCCAATTAACGAA

and the HACk bed:

test    5       6       deletion        None    0

The following haplotype is produced:

>test
ACGTCCAATTAACGAA

We expected just the A would be removed. However, AG is removed instead, following ACGT.

Thanks for looking into it!

davidebolo1993 commented 4 years ago

Hi @cfriedline,

Generating 1 bp insertions/deletions is not the main purpose of VISOR, as it is a structural variant simulator (variants longer than 50 bps). However, I agree that the HACk module should be able to provide correct results for such a case. I'll look into this as soon as possible and let you know.

Best,

Davide

davidebolo1993 commented 4 years ago

Hi @cfriedline,

I've somehow fixed this. In principle, the previous behaviour was correct as HACk deleted from start to end (start and end are included). This is what most people expect from a SV simulator, I guess. I've briefly reworked the code and now you should be able to get a 1bp deletion using the BED file provided above. What I came up whit, however, prevents the creation of 2 bp-deletions, which, as I said, should be still fine for VISOR, as it is mainly used to generate variants larger than 50 bps. Let me know if this solved your issue.

Best,

Davide

cfriedline commented 4 years ago

Hi @davidebolo1993. You're so fast - I didn't even get to respond back to you!

From our perspective, if we can do 1bp insertions and substitute 1 base in SNP mode, then it seemed coherent to us that being able to make a 1bp deletion would also be possible. Our use case might not align with the majority of VISOR users, but we are doing benchmarking/validation which is critical to assessing and deploying some clinical assays. Running through a range of deletions, starting at 1, is part of that spec. Your change has now made it impossible to do a 2bp deletion, which is also on our road map, so it seems like we've maybe swapped one problem for another.

A change I had in mind would be to have a flag that made the end coordinate inclusive or exclusive (--include-end/--no-include-end). There are likely downstream effects to other parts of HACk, but passing --no-include-end with the bed file above would yield the desired results. Just an idea.

Thanks for jumping on this so quickly.

davidebolo1993 commented 4 years ago

Hi @cfriedline,

yes, that indeed would be possible. I'm on vacation right now, but I'll add this as soon as I'm back. For the time being, you can generate 1 bp deletions/insertions and insertions of any length. I'll let you know when I'm done with this.

Best,

Davide

davidebolo1993 commented 4 years ago

Hi @cfriedline,

I found some time to go through this. For deletions, the info field (5th entry of the BED file) can now be either None or 1bp. When None, HACk acts normally and deletes from start to end (both included); when 1bp, if start and end are 1 bp-close, it deletes only the nucleotide in start. In practice, for your example sequence

>test
ACGTAGCCAATTAACGAA

the BED file

test    5       6       deletion        None    0

deletes 2 bps (AG, as for the normal HACk behaviour), while the BED file:

test    5       6       deletion        1bp    0

deletes just the A. For larger deletions (end-start > 1), the normal HACk behaviour is used. This should solve your issue.

Best,

Davide