Most of the time that I want to use subtractBed, the b file is very big (e.g.: 2 GiB: 66 million lines) as it contains all known snp positions.
This command is then very slow (file a contains only 10000 positions) and sometimes is unable to run as there is not enough memory to load the b file in memory:
subtractBed -a sample_snps.bed -b allknownsnps.bed
This is the error I get when I run it on a machine with 32GiB of RAM:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
When running the command with strace:
strace -o subtractBed.strace subtractBed -a sample_snps.bed -b allknownsnps.bed
# Line number lose to where the PC runs out of memory: 58469306
$ grep -n -m1 '^chr17'$'\t''67999172' allknownsnps.bed
58469306:chr17 67999172 67999173 T G PASS
# Total number of lines in the file: 66007044
$ wc -l allknownsnps.bed
66007044
It would be nice to use subtractBed like this:
subtractBed -loada -a sample_snps.bed -b allknownsnps.bed
Where -loada loads file a in memory and b from disk.
So if file b is read line by line, it just needs to remove all entries from file a (that is loaded in memory) that are found in file b.
I know it is possible to mimic this behaviour with, the following, but it would be much easier if it was implemented in subtractBed directly:
intersectBed -wb -a allknownsnps.bed -b sample_snps.bed | subtractBed -a sample_snps.bed -b stdin
Most of the time that I want to use subtractBed, the b file is very big (e.g.: 2 GiB: 66 million lines) as it contains all known snp positions.
This command is then very slow (file a contains only 10000 positions) and sometimes is unable to run as there is not enough memory to load the b file in memory:
This is the error I get when I run it on a machine with 32GiB of RAM:
When running the command with strace:
This is the end of the log:
It would be nice to use subtractBed like this:
Where -loada loads file a in memory and b from disk. So if file b is read line by line, it just needs to remove all entries from file a (that is loaded in memory) that are found in file b.
I know it is possible to mimic this behaviour with, the following, but it would be much easier if it was implemented in subtractBed directly: