GMOD / tabix-js

Read Tabix-indexed files, either with .tbi or .csi indexes, in node or the browser
MIT License
14 stars 5 forks source link

browser can run out of memory for large query ranges #118

Closed jrobinso closed 3 years ago

jrobinso commented 3 years ago

Hi all, sorry I don't have time for a proper pull request but I thought I would alert you to this. I had a report of CSI indexes "not working" in igv.js. The user was querying over a range spanning entire chromosomes. I traced the problem to reg2bins, specifically this line

for (let i = b; i <= e; i += 1) bins.push(i)

The problem is the range b-e can be quite large. I fixed in in igv.js by pushing the range, rather that each individual bin

bins.push([b, e])

and modifying the use of "reg2bins" accordingly


            const overlappingBins = this.reg2bins(min, max); 
            for (let binRange of overlappingBins) {
                for (let bin = binRange[0]; bin <= binRange[1]; bin++) {

I don't use GMOD/tabix-js (yet), although I might in the future to reduce code duplication, but I did look at your implementation to see if it also suffered from this problem (it does). So this is just a heads up, do with it as you please. Sorry again I don't have time for a proper PR but I am slammed as always.

cmdcolin commented 3 years ago

@jrobinso do you find that using this results in any slowness?

For example, the list of blocks that can be returned is discontinuous like

[ 0, 1, 9, 73, 585, 4681 ]

Is it helpful to just fetch every block from [0,4681]?

jrobinso commented 3 years ago

@cmdcolin I think you are referring to the "bins", really should be called bin numbers, not the physical blocks on disk. I'm not looking at code now, but bins get mapped to physical blocks, then generally there is an optimization step where adjacent or nearby blocks are merged. In the end there is usually a single or small number of blocks.

cmdcolin commented 3 years ago

ah I gotcha

my thought was that you suggested returning only a single bin range but it looks like your code would return a list of ranges

great to know!