etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
502 stars 163 forks source link

Setting arbitrary bin size for WGS data #775

Open ashenflower opened 1 year ago

ashenflower commented 1 year ago

Hello,

I need to compare different CNV callers before the segmentation step, and I would like for all of them to produce results on bins having the same size. Is it possible to manually set a value for the bin size when using the -m wgs option?

I couldn't find it in the documentation.

Thanks in advance!

tetedange13 commented 1 year ago

Hi @ashenflower ,

Maybe you are looking for --target-avg-size and --antitarget-avg-size parameters of batch subcommand ? (see cnvkit.py batch --help)

Hope this helps ! Have a nice day, Felix.

ashenflower commented 1 year ago

Hello @tetedange13 , thank you very much for your reply! That's what I was looking for (I had seen that parameter before but I had misunderstood its usage).

But another question raised in my mind after setting the --target-avg-size equal to 1000 and running the batch pipeline. If I am correct, the .cnr file in output should report the log2 ratio for each bin, and I wanted to apply the CN calling directly on this file. However, it seems that not all the bins are listed in the file (for e.g. for some chromosomes bins are not starting from 0). Is that correct? Why is that so?

Also, in my genomic.target.bed file, several sequences are missing, even if they are listed between the 'Accessible regions' in the log file, and I don't undersand why.

Thank you so much for your help!

ashenflower commented 1 year ago

This is the log file I get by applying CNVkit on an old reference:

WGS protocol: recommend '--annotate' option (e.g. refFlat.txt) to help locate genes in output files.
NC_000001.10: Scanning for accessible regions
        Accessible region NC_000001.10:10000-177417 (size 167417)
        Accessible region NC_000001.10:227417-267719 (size 40302)
        Accessible region NC_000001.10:317719-471368 (size 153649)
        Accessible region NC_000001.10:521368-2634220 (size 2112852)
        Accessible region NC_000001.10:2684220-3845268 (size 1161048)
        Accessible region NC_000001.10:3995268-13052998 (size 9057730)
        Accessible region NC_000001.10:13102998-13219912 (size 116914)
        Accessible region NC_000001.10:13319912-13557162 (size 237250)
        Accessible region NC_000001.10:13607162-17125658 (size 3518496)
        Accessible region NC_000001.10:17175658-29878082 (size 12702424)
        Accessible region NC_000001.10:30028082-103863906 (size 73835824)
        Accessible region NC_000001.10:103913906-120697156 (size 16783250)
        Accessible region NC_000001.10:120747156-120936695 (size 189539)
        Accessible region NC_000001.10:121086695-121485434 (size 398739)
        Accessible region NC_000001.10:142535434-142731022 (size 195588)
        Accessible region NC_000001.10:142781022-142967761 (size 186739)
        Accessible region NC_000001.10:143117761-143292816 (size 175055)
        Accessible region NC_000001.10:143342816-143544525 (size 201709)
        Accessible region NC_000001.10:143644525-143771002 (size 126477)
        Accessible region NC_000001.10:143871002-144095783 (size 224781)
        Accessible region NC_000001.10:144145783-144224481 (size 78698)
        Accessible region NC_000001.10:144274481-144401744 (size 127263)
        Accessible region NC_000001.10:144451744-144622413 (size 170669)
        Accessible region NC_000001.10:144672413-144710724 (size 38311)
        Accessible region NC_000001.10:144810724-145833118 (size 1022394)
        Accessible region NC_000001.10:145883118-146164650 (size 281532)
        Accessible region NC_000001.10:146214650-146253299 (size 38649)
        Accessible region NC_000001.10:146303299-148026038 (size 1722739)
        Accessible region NC_000001.10:148176038-148361358 (size 185320)
        Accessible region NC_000001.10:148511358-148684147 (size 172789)
        Accessible region NC_000001.10:148734147-148954460 (size 220313)
        Accessible region NC_000001.10:149004460-149459645 (size 455185)
        Accessible region NC_000001.10:149509645-205922707 (size 56413062)
        Accessible region NC_000001.10:206072707-206332221 (size 259514)
        Accessible region NC_000001.10:206482221-223747846 (size 17265625)
        Accessible region NC_000001.10:223797846-235192211 (size 11394365)
        Accessible region NC_000001.10:235242211-248908210 (size 13665999)
        Accessible region NC_000001.10:249058210-249240621 (size 182411)
NT_113878.1: Scanning for accessible regions
        Accessible region NT_113878.1:0-106433 (size 106433)
NT_167207.1: Scanning for accessible regions
        Accessible region NT_167207.1:0-547496 (size 547496)
NC_000002.11: Scanning for accessible regions
        Accessible region NC_000002.11:10000-3529312 (size 3519312)
        Accessible region NC_000002.11:3579312-5018788 (size 1439476)
        Accessible region NC_000002.11:5118788-16279724 (size 11160936)
        Accessible region NC_000002.11:16329724-21153113 (size 4823389)
        Accessible region NC_000002.11:21178113-31705550 (size 10527437)
        Accessible region NC_000002.11:31705551-31725939 (size 20388)
        Accessible region NC_000002.11:31726790-31816827 (size 90037)
        Accessible region NC_000002.11:31816828-31816854 (size 26)
        Accessible region NC_000002.11:31816855-31816858 (size 3)
        Accessible region NC_000002.11:31816859-33092197 (size 1275338)
        Accessible region NC_000002.11:33093197-33141692 (size 48495)
        Accessible region NC_000002.11:33142692-87668206 (size 54525514)
        Accessible region NC_000002.11:87718206-89630436 (size 1912230)
        Accessible region NC_000002.11:89830436-90321525 (size 491089)
        Accessible region NC_000002.11:90371525-90545103 (size 173578)
        Accessible region NC_000002.11:91595103-92326171 (size 731068)
        Accessible region NC_000002.11:95326171-110109337 (size 14783166)
        Accessible region NC_000002.11:110251337-149690582 (size 39439245)
        Accessible region NC_000002.11:149790582-234003741 (size 84213159)
        Accessible region NC_000002.11:234053741-239801978 (size 5748237)
        Accessible region NC_000002.11:239831978-240784132 (size 952154)
        Accessible region NC_000002.11:240809132-243102476 (size 2293344)
        Accessible region NC_000002.11:243152476-243189373 (size 36897)
NC_000003.11: Scanning for accessible regions
        Accessible region NC_000003.11:60000-17137943 (size 17077943)
NT_113878.1: Joining over small gaps
NT_167207.1: Joining over small gaps
Wrote GCF_000001405.25_GRCh37.p13_genomic.bed with 2 regions

So it seems to me that all sequences NC_000001.10, NT_113878.1, NT_167207.1, NC_000002.11 and NC_000003.11 have large accessible regions, but in the end only NT_113878.1 and NT_167207.1 got in my target.bed file.