Closed mimakaev closed 4 years ago
@mimakaev Good idea. We'll look into it very soon.
@mimakaev Actually it would be helpful if you could provide a bit more details. (I know it's been a while.) Here's what we currently have (already implemented as of 0.3.6 / 0.3.7).
pypairix
: pypairix.open
pre-loads the index into the memory and each query uses this pre-loaded index.pairix
: multiple queries are possible with a single command, in which case the index will be loaded only once and used for each of the queries.
e.g. pairix <pairsfile> 'chr21|chr22:1-20000000' 'chr21|chr22:20000000-40000000' 'chr21|chr22:40000000-80000000'
When you measured the speed, did you use either of these usages? Or did you simply repeat single queries with pairix <pairsfile> <query>
? If the latter, it would be difficult to store an index into a memory and transfer it between commands. If the former, we could improve the speed in another way - probably reduce time for each query but more time for loading index, so it would hurt the performance for a single-query pairix
command but may help with a large number of batch queries.
I actually just used the command line tool. I did not test pypairix - will give it a try.
@mimakaev You can also use pairix -L <pairsfile> <coordsfile1> <coordsfile2> ...
for batch runs (note the positions of the files). The index will be loaded only once.
e.g.
$ pairix -L samples/test_4dn.pairs.gz samples/test.regions samples/test.regions2
SRR1658581.5372838 chr1 39498 chr15 101996781 - -
SRR1658581.23133649 chr1 14617 chr19 6676491 - +
SRR1658581.22713561 chr1 14477 chrX 107423076 - +
SRR1658581.42667491 chrX 108353091 chrY 2961555 - -
SRR1658581.55524746 chr19 105058 chr19 105558 + -
$ cat samples/test.regions
chr1:1-50000|*
*|chr1:1-50000
chr2:1-20000|*
*|chr2:1-20000
$ cat samples/test.regions2
chrX:100000000-110000000|chrY
chr19:1-300000|chr19
Currently, even with a smallest .pairs file (~100MB), it takes 0.3 seconds to fetch a little region (two lines) from the file with 23 chromosomes, and 0.6 seconds to fetch it from a file with full hg38 set of 455 chromosomes. It takes up to a second for larger files (e.g. Bonev et al), even when the result is just a several lines or is empty.
For comparison, if the file only has one chromosome, a fetch takes 0.007 seconds, which is 100 times faster!
Basically, there is a flat price of fetching a region of any size, and sometimes it is pretty steep.
This can interfere with analyses that use pairix to iterate over pairs of chromosomes. Especially if someone forgot to trim down all the contigs from hg38, and started working with the whole file with 455 chromosomes. Trying to fetch all chromosome pairs would take 455 455 0.6 = 34 hours of overhead. In that time, for example, one can sort, ingest, and process a largest existing pairs file several times.
One possible solution would be to provide an interactive library that pre-loads the pairix index into memory, parses it as needed, and then does a very quick fetch. Working with this library interactively would speed up development of several algorithms, such as "cooler cload pairix".