arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
140 stars 85 forks source link

Enhancement: subtractBed: Add switch to allow -a file to be read in memory and -b from disk #92

Open ghuls opened 10 years ago

ghuls commented 10 years ago

Most of the time that I want to use subtractBed, the b file is very big (e.g.: 2 GiB: 66 million lines) as it contains all known snp positions.

This command is then very slow (file a contains only 10000 positions) and sometimes is unable to run as there is not enough memory to load the b file in memory:

subtractBed -a sample_snps.bed -b allknownsnps.bed

This is the error I get when I run it on a machine with 32GiB of RAM:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

When running the command with strace:

strace -o subtractBed.strace subtractBed -a sample_snps.bed -b allknownsnps.bed

This is the end of the log:

read(3, "17\t67979430\t67979431\tG\tA\tPASS\nch"..., 8191) = 8191
brk(0x6a65bf000)                        = 0x6a65bf000
read(3, "7985837\tG\tC\tMinDP\nchr17\t67985847"..., 8191) = 8191
brk(0x6a65e7000)                        = 0x6a65e7000
read(3, "\tA\tG\tPASS\nchr17\t67992208\t6799220"..., 8191) = 8191
read(3, "\nchr17\t67999172\t67999173\tT\tG\tPAS"..., 8191) = 8191
brk(0x6a660b000)                        = 0x6a65e7000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "terminate called after throwing "..., 48) = 48
write(2, "std::bad_alloc", 14)          = 14
write(2, "'\n", 2)                      = 2
write(2, "  what():  ", 11)             = 11
write(2, "std::bad_alloc", 14)          = 14
write(2, "\n", 1)                       = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid()                                = 26915
tgkill(26915, 26915, SIGABRT)           = 0
--- SIGABRT (Aborted) @ 0 (0) ---
+++ killed by SIGABRT (core dumped) +++
# Line number lose to where the PC runs out of memory: 58469306
$ grep -n -m1 '^chr17'$'\t''67999172' allknownsnps.bed
58469306:chr17  67999172    67999173    T   G   PASS

# Total number of lines in the file: 66007044
$ wc -l allknownsnps.bed
66007044 

It would be nice to use subtractBed like this:

subtractBed -loada -a sample_snps.bed -b allknownsnps.bed

Where -loada loads file a in memory and b from disk. So if file b is read line by line, it just needs to remove all entries from file a (that is loaded in memory) that are found in file b.

I know it is possible to mimic this behaviour with, the following, but it would be much easier if it was implemented in subtractBed directly:

intersectBed -wb -a allknownsnps.bed -b sample_snps.bed | subtractBed -a sample_snps.bed -b stdin
ghuls commented 10 years ago

This is probably technically more correct (no -wb) as this allows to still use additional options for subtractBed:

intersectBed -a allknownsnps.bed -b sample_snps.bed | subtractBed -a sample_snps.bed -b stdin