4dn-dcic / pairix

1D/2D indexing and querying on bgzipped text file with a pair of genomic coordinates
MIT License
83 stars 13 forks source link

Fetching small regions can be very slow #59

Closed mimakaev closed 4 years ago

mimakaev commented 5 years ago

Currently, even with a smallest .pairs file (~100MB), it takes 0.3 seconds to fetch a little region (two lines) from the file with 23 chromosomes, and 0.6 seconds to fetch it from a file with full hg38 set of 455 chromosomes. It takes up to a second for larger files (e.g. Bonev et al), even when the result is just a several lines or is empty.

For comparison, if the file only has one chromosome, a fetch takes 0.007 seconds, which is 100 times faster!

Basically, there is a flat price of fetching a region of any size, and sometimes it is pretty steep.

This can interfere with analyses that use pairix to iterate over pairs of chromosomes. Especially if someone forgot to trim down all the contigs from hg38, and started working with the whole file with 455 chromosomes. Trying to fetch all chromosome pairs would take 455 455 0.6 = 34 hours of overhead. In that time, for example, one can sort, ingest, and process a largest existing pairs file several times.

One possible solution would be to provide an interactive library that pre-loads the pairix index into memory, parses it as needed, and then does a very quick fetch. Working with this library interactively would speed up development of several algorithms, such as "cooler cload pairix".

SooLee commented 5 years ago

@mimakaev Good idea. We'll look into it very soon.

SooLee commented 5 years ago

@mimakaev Actually it would be helpful if you could provide a bit more details. (I know it's been a while.) Here's what we currently have (already implemented as of 0.3.6 / 0.3.7).

When you measured the speed, did you use either of these usages? Or did you simply repeat single queries with pairix <pairsfile> <query>? If the latter, it would be difficult to store an index into a memory and transfer it between commands. If the former, we could improve the speed in another way - probably reduce time for each query but more time for loading index, so it would hurt the performance for a single-query pairix command but may help with a large number of batch queries.

mimakaev commented 5 years ago

I actually just used the command line tool. I did not test pypairix - will give it a try.

SooLee commented 5 years ago

@mimakaev You can also use pairix -L <pairsfile> <coordsfile1> <coordsfile2> ... for batch runs (note the positions of the files). The index will be loaded only once.

e.g.

$ pairix -L samples/test_4dn.pairs.gz samples/test.regions  samples/test.regions2
SRR1658581.5372838  chr1    39498   chr15   101996781   -   -
SRR1658581.23133649 chr1    14617   chr19   6676491 -   +
SRR1658581.22713561 chr1    14477   chrX    107423076   -   +
SRR1658581.42667491 chrX    108353091   chrY    2961555 -   -
SRR1658581.55524746 chr19   105058  chr19   105558  +   -
$ cat samples/test.regions
chr1:1-50000|*
*|chr1:1-50000
chr2:1-20000|*
*|chr2:1-20000
$ cat samples/test.regions2
chrX:100000000-110000000|chrY
chr19:1-300000|chr19