databio / AIList

Augmented interval list (AIList): a novel data structure for efficient genomic interval search
http://ailist.databio.org
GNU General Public License v2.0
21 stars 5 forks source link

Augmented Interval List: a novel data structure for efficient genomic interval search

Updates

Augmented interval list (AIList) is a new data structure for enumerating intersections between a query interval q and an interval set R. It is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few flattened components (sublists), and then augmenting each sublist with the running maximum interval end. It nearly minimizes the search space with ignorable additional cost, and thus optimizes the interval search. A preliminary implementation of AIList showed that AIList is several times faster than interval tree, NCList and BEDTools, see ailist_paper. The complete documentation can be found at ailist_doc.

Since the first release of our ailist_paper, a new representative implementation of the ITree is developed by Dr Heng Li (Li's ITree or LITree), which showed significant improvement over other implementations of the ITree and it achieves similar efficiency of the original implementation of AIList on a few datasets. Here we re-implemented AIList by following the superior treatments of LITree on sorting and data loading (for which we are grateful to Dr Li) and updated the results for many more datasets. We also analyzed the results with emphasis on the understanding of the underline data structure differences among AIList, NCList, and ITrees.

Test results

Nine datasets in the following table are used for the test. The first 7 datasets are used in our original paper and the last two are used by LITree. They are all available here. A flat interval list (null interval list) can be treated as a simple list, so only a sinlge binary search is needed for a query. Dataset 1 is flat (1 total list with no sublist) and datasets 2 and 7 are near flat (only one sublist).

Dataset# Name(.bed) size(x1000) non-flatness
0 chainRn4 2,351 6
1 fBrain 199 1
2 exons 439 2
3 chainOrnAna1 1,957 6
4 chainVicPac2 7,684 8
5 chainXenTro3Link 50,981 7
6 chainMonDom5Link 128,187 7
7 ex-anno 1,194 2
8 ex-rna 9,945 7

The re-implemented AIList has the same size of core data members as LITree and the time for the construction and result output of AIList is almost the same as that of the LITree, so the difference is on the query speed, which is the purpose of this comparison.

The command for AIList:

time ailist dataset1.bed dataset2.bed > /dev/null

For cgranges (LITree):

time bedcov-cr dataset1.bed dataset2.bed -c >/dev/null

The first BED file is loaded into RAM and constructed as a 'database', intervals in the second file are used as queries against the 'database' as shown here for columns 2 and 3; columns 4 and 5 reverse the order of the two datasets.

Datasets AIList LITree AIList,r LITree,r
0, 1 0.441s 0.440s 0.774s 0.882s
0, 2 0.538s 0.595s 0.805s 0.940s
0, 3 2.124s 3.352s 2.598s 5.777s
0, 4 7.512s 14.102s 6.573s 11.430
0, 5 26.499s 43.498s 39.562s 1m10.207s
0, 6 50.551s 1m14.538s 1m20.744s 2m24.714s
0, 7 0.955s 1.089s 0.939s 1.052s
0, 8 3.382s 4.006s 9.430s 9.867s
7, 8 6.868s 6.967s 3.198s 3.848s
3, 6 2m20.725s 5m36.132s 1m17.778s 2m7.386s

Understanding the underline data structures of AIList and LITree

Use ailist in python

A Python wrapper of the AIList c code is included in folder src_py and the search is very fast (close to AIList c code). To install after cloning this repo:

cd AIList/src_py
python setup.py install

A small python program ailist_test.py is included, which can be used this way:

python ailist_test.py test1.bed test2.bed