LRU cache for file handles

jeromekelleher commented 9 years ago

Currently, the reference server opens a htslib backed file handle for each FASTA, VCF and BAM file that it serves on startup. When a incoming request is processed, the queries are run on the relevant files using pysam. This has the advantage of simplicity, and means that we don't need to open the file each time a request comes in (which would be prohibitively expensive). However, this approach scales poorly with increasing numbers of VCF and BAM files: (1) server startup becomes very slow; (2) we use a lot of memory keeping files we don't need open; (3) we start hitting OS limits on the number of open file descriptors.

A good compromise between keeping files open all the time and opening a file each time we receive a query is to maintain a least recently used cache of data file handles. We have a configurable limit on the number of concurrently open data files, and each time we receive a request we ask a central cache to open a file for us. If it is already open, it returns the file handle; if not, it opens it for us, possibly closing another file first.

Possible implementation

The cache should operate at the level of the datamodel.PysamDatamodelMixin class, which is a mixin superclass of all datamodel objects (VariantSet, ReadGroupSet and ReferenceSet) offering an interface to htslib backed data files. This class can have a class method that takes a path (possibly a URL) and a method and returns an opened file handle. The cache itself should probably be an instance of another class (.e.g. PysamFileHandleCache) to avoid making the mixin too complicated.
PysamFileHandleCache has a method getFileHandle(self, url, method), which returns the opened file handle. The simplest (non-caching) implementation of this would be

     def getFileHandle(self, url, openMethod):
         return openMethod(url)

We then remove the instance variables holding the file handles in HtslibReadGroup, HtslibVariantSet and HtslibReferenceSet, and dynamically get the file handles using the cache instead when processing queries in getVariants, etc.

This is only a suggested implementation, and may well change considerably as the actual coding is done.

Complications

This is slightly complicated by the fact that all the data for (e.g.) VariantSets is not contained in a single file. We sometimes have data spread across multiple files in a directory (usually, one per chromosome) and then figure out which file we need to query at run time. The _chromFileMap is used for this in HtslibVariantSet. We can easily change this to map from chromosome to the file's URL, and use this to open the file at run time. However, we'll still need to open the files at startup to examine their contents and build up the map. Initially, it's probably easiest to skip the cache at this point, and just open the file, read the information we need, and close it again.
Htslib's treatment of remote data sets via URLs will be very problematic in this setting, as it can take a very long time to open a file from a URL (and currently, there is a messy problem in dealing with the index files that Htslib wants to download into the CWD). It's not clear what the best option is here, but we should probably deal with it after the upcoming updates to Htslib to work with remote data sources better has been released and incorporated into pysam.

dcolligan commented 9 years ago

python 3 has a functools.lru_cache method that has been backported in various unofficial libraries http://stackoverflow.com/questions/11815873/memoization-library-for-python-2-7

jeromekelleher commented 9 years ago

Very nice --- let's use that unless there's a good reason not to.

dcolligan commented 9 years ago

The size of the cache should also be a configurable variable in serverconfig.py

eelsther commented 9 years ago

Hi, I am Esther from LShift, working on this issue at the moment. The only drawback that I see currently in using functools32 lru_cache function is that it is not obvious that the file is closed when the handle is removed from the cache. There is no way to add some kind of hook when an item is deleted from the cache... Another thing is testing! Do you know how I could test the performance improvement? I guess I would need to have more data files in the data directory. Do you have any hints on where I could find more data files?

Thanks

Esther

jeromekelleher commented 9 years ago

@eelsther, great to hear you're working on this. To be honest, I'd be surprised if the functools32 cache ended up fitting our requirements exactly, so I wouldn't try too hard to make it fit. It should be quite easy to make the priority queue (or whatever we need) using something like the heapq module. So, whatever is easiest, I would say go for it.

In terms of performance testing, I would just make a few thousand copies of the files we have for testing. Some data files (VCF and BAM) are kept in tests/data/dataset1. A good scenario would be having (say) a hundred datasets, each containing a hundred variantsets and readgroupsets. Once these have unique names, it makes no difference if they all contain identical data.

bioinformed commented 9 years ago

The other factor here is that we may need several LRU/LFU caches.

variant files, primarily to limit the number of open file handles.
variant index data. These are read into memory and can consume a few hundred KB to a few MB. Depending on OS file caching, these re-reading an index can be much more expensive than re-opening a variant file.
[Future] variant file header information. Parsing variant file header information can be quite expensive, particularly if the headers contain extensive metadata. It is possible to cache the header and when re-opening a variant file, skip the header and seek to the start of the data blocks. I'm working to create a "re-open" mode to pysam that will make this easy.

-Kevin

On Mon, Jun 15, 2015 at 9:40 AM, Jerome Kelleher notifications@github.com wrote:

@eelsther https://github.com/eelsther, great to hear you're working on this. To be honest, I'd be surprised if the functools32 cache ended up fitting our requirements exactly, so I wouldn't try too hard to make it fit. It should be quite easy to make the priority queue (or whatever we need) using something like the heapq module. So, whatever is easiest, I would say go for it.

In terms of performance testing, I would just make a few thousand copies of the files we have for testing. Some data files (VCF and BAM) are kept in tests/data/dataset1. A good scenario would be having (say) a hundred datasets, each containing a hundred variantsets and readgroupsets. Once these have unique names, it makes no difference if they all contain identical data.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/server/issues/441#issuecomment-112073338.

eelsther commented 9 years ago

Thanks for your comments. I have finally decided to use an ordered dictionary for the cache which works pretty well. The server startup is of course much faster now! I haven't pushed my changes yet, still testing...

I wanted to know if you had any FASTA files for references that would help me to test the cache. Once I have the files, I can easily figure out a query but it would be great if you could give me any query that you have in mind.

Thanks!

pgrosu commented 9 years ago

@eelsther An OrderedDict might become very slow - below is a link for a quick explanation:

http://stackoverflow.com/questions/18422995/why-is-ordereddict-10x-slower-than-dict-and-list

~p

jeromekelleher commented 9 years ago

Great news @eelsther! The simplest thing to do with FASTA files is probably just to generated some random ones using the script at https://github.com/ga4gh/server/blob/develop/scripts/generate_fasta.py

eelsther commented 9 years ago

Thank you @jeromekelleher I'll have a look at the script.

@pgrosu : thanks for your comment, I wasn't aware of that. I don't think that's really surprising: I think that the fact it is ordered can indeed cause a slowness (and as stated on stackoverflow the fact that it is written in Python rather than C), but at least retrieving the smallest element is simple and effective. I also tried to use a heapq structure coupled with a dictionary but I don't think it faster than an ordered dictionary and it adds more complexity to the code. I also looked at deque which might be interesting but haven't tested it yet.

pgrosu commented 9 years ago

@eelsther I agreed and maybe a priority queue might also help as @jeromekelleher suggested earlier :) You could even add a secondary lookup structure to drive the time complexity down to O(1), while performing the cleanup in the main structure as a background service - again many options :)

eelsther commented 9 years ago

I finally decided to use a deque for the cache which is indeed faster than OrderedDict. Insertion and deletion are both O(1). Updating an entry is O(n) in the worst case. I also use a dictionary to memoize the function opening the file handles. When the dictionary reaches the configurable max size, the least recently used file is removed from the deque.

pgrosu commented 9 years ago

@eelsther That's fantastic news - thank you for helping us with this!

~p

ga4gh / ga4gh-server

LRU cache for file handles #441