adamewing / bamsurgeon

tools for adding mutations to existing .bam files, used for testing mutation callers
MIT License
235 stars 86 forks source link

sizes of successful CNVs - smaller than expected #166

Open RichardCorbett opened 4 years ago

RichardCorbett commented 4 years ago

Hi there, I have been using bamsurgeon to simulate germline copy number changes. I am inserting deletions and duplications from size ranges of 100bp to 10Mb.

One of my target sizes is 5Kb and when I went to check how many of my target variants were successfully integrated I found that most of those that were successful were smaller than the desired event, usually resulting in a deletion of 1kb to 3kb.

This led me to do a test across a range of sizes to see if there was a size "hump" I needed to get over.

I created lists of ~100 homozygous deletions at each target size of 1000,2000....10Kb and tested to see what the size distribution of the successfully integrated variants would be.

image

Here are some example lines of the variants I am attempting to integrate for the 8Kb tests.

1 5048109 5056109 DEL 1 1 21392935 21400935 DEL 1 1 72189128 72197128 DEL 1 1 77025708 77033708 DEL 1

It looks like there is a limit in this range capping the sizes of target events around 2800bp. Is there a way to get around this?

thanks, Richard

adamewing commented 4 years ago

Hi Richard, Sorry to hear you're having trouble - that does look strange. Could you try exchanging "DEL" for "BIGDEL" in the mutation input file and let me know how you go?

RichardCorbett commented 4 years ago

Thanks @adamewing, I'm trying the same test, but this time using only "BIGDEL" events. For the sets with events of sizes 1Kb-4Kb I get an error at the beginning of the run after I get a warning for each of my events:

WARNING 2020-09-02 08:29:33,818 Y 40543929 40547929 BIGDEL 1 is under 5kbp, "BIG" mutation types will yield unpredictable results, converting to DEL
WARNING 2020-09-02 08:29:33,819 Y 40901475 40905475 BIGDEL 1 is under 5kbp, "BIG" mutation types will yield unpredictable results, converting to DEL
WARNING 2020-09-02 08:29:33,819 Y 49701495 49705495 BIGDEL 1 is under 5kbp, "BIG" mutation types will yield unpredictable results, converting to DEL
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/site-packages/bamsurgeon-1.2-py3.6.egg/EGG-INFO/scripts/addsv.py", line 550, in makemut
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/addsv.py", line 4, in <module>
    __import__('pkg_resources').run_script('bamsurgeon==1.2', 'addsv.py')
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 654, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1441, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python3.6/site-packages/bamsurgeon-1.2-py3.6.egg/EGG-INFO/scripts/addsv.py", line 1358, in <module>
  File "/usr/local/lib/python3.6/site-packages/bamsurgeon-1.2-py3.6.egg/EGG-INFO/scripts/addsv.py", line 1121, in main
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
IndexError: list index out of range

For the set of 5k events it does seem to fire up ok, but appears to fail when trying to insert the variants:

INFO 2020-09-02 08:38:00,539 X_28368363_28368363_BIGDEL removing addsv.tmp/X_28368363_28368363_BIGDEL.wgsimtmp.1df84ab2-e7d5-4acc-9afe-8dfa0a68d67b.2.fq
INFO 2020-09-02 08:38:00,549 X_28368363_28368363_BIGDEL temporary bam: addsv.tmp/X_28368363_28368363_BIGDEL.0cba3dc8-7c31-4548-b461-35bc6f806757.muts.bam
INFO 2020-09-02 08:38:01,144 5_8744659_8744659_BIGDEL best contig length: 2184
INFO 2020-09-02 08:38:01,144 5_8744659_8744659_BIGDEL best transloc contig length: 8676
INFO 2020-09-02 08:38:01,207 5_8744659_8744659_BIGDEL alignment result: ['SUMMARY', '8841', '349', '2128', '2221', '4000']
INFO 2020-09-02 08:38:01,209 5_8744659_8744659_BIGDEL trimmed contig length: 1779
INFO 2020-09-02 08:38:01,209 5_8744659_8744659_BIGDEL start: 8742659, end: 8746659, tgtstart: 2221, tgtend: 4000, refstart: 8744880, refend: 8746659
INFO 2020-09-02 08:38:01,286 5_8744659_8744659_BIGDEL alignment result: ['SUMMARY', '29838', '2574', '8547', '2027', '8000']
INFO 2020-09-02 08:38:01,291 5_8744659_8744659_BIGDEL trimmed contig length: 5973
INFO 2020-09-02 08:38:01,291 5_8744659_8744659_BIGDEL trn_start: 8745659, trn_end: 8753659, trn_tgtstart: 2027, trn_tgtend:8000 , trn_refstart: 8747686, trn_refend: 8753659
WARNING 2020-09-02 08:38:01,292 5_8744659_8744659_BIGDEL best contig too short to make mutation!
Traceback (most recent call last):
  File "/usr/local/bin/addsv.py", line 4, in <module>
    __import__('pkg_resources').run_script('bamsurgeon==1.2', 'addsv.py')
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 654, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1441, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python3.6/site-packages/bamsurgeon-1.2-py3.6.egg/EGG-INFO/scripts/addsv.py", line 1358, in <module>
  File "/usr/local/lib/python3.6/site-packages/bamsurgeon-1.2-py3.6.egg/EGG-INFO/scripts/addsv.py", line 1156, in main
  File "/usr/local/lib/python3.6/site-packages/bamsurgeon-1.2-py3.6.egg/EGG-INFO/scripts/addsv.py", line 460, in fetch_read_names
  File "pysam/libcalignmentfile.pyx", line 1081, in pysam.libcalignmentfile.AlignmentFile.fetch
  File "pysam/libchtslib.pyx", line 690, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid coordinates: start (37330220) > stop (37329220)

For the events that are 6Kb-9Kb, they are still running. Hopefully at the end of the day I'll see if the BIGDEL parameter helped with those.

RichardCorbett commented 4 years ago

For the events that were 6kb-9kb, using BIGDEL seems to have fixed the issue. image

adamewing commented 4 years ago

OK, thanks for the analysis. The sizes come from the truth VCF, right? The intended behaviour is for addsv to switch to the "bigdel" method automatically when the input target is > 5kbp. Still unclear why you're hitting a limit, will investigate.

RichardCorbett commented 4 years ago

Yes, the SV sizes i am pulling out are coming from the SVLEN tag in the created VCF files.