ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 91 forks source link

LRU cache for file handles #441

Closed jeromekelleher closed 9 years ago

jeromekelleher commented 9 years ago

Currently, the reference server opens a htslib backed file handle for each FASTA, VCF and BAM file that it serves on startup. When a incoming request is processed, the queries are run on the relevant files using pysam. This has the advantage of simplicity, and means that we don't need to open the file each time a request comes in (which would be prohibitively expensive). However, this approach scales poorly with increasing numbers of VCF and BAM files: (1) server startup becomes very slow; (2) we use a lot of memory keeping files we don't need open; (3) we start hitting OS limits on the number of open file descriptors.

A good compromise between keeping files open all the time and opening a file each time we receive a query is to maintain a least recently used cache of data file handles. We have a configurable limit on the number of concurrently open data files, and each time we receive a request we ask a central cache to open a file for us. If it is already open, it returns the file handle; if not, it opens it for us, possibly closing another file first.

Possible implementation

     def getFileHandle(self, url, openMethod):
         return openMethod(url)

This is only a suggested implementation, and may well change considerably as the actual coding is done.

Complications

dcolligan commented 9 years ago

python 3 has a functools.lru_cache method that has been backported in various unofficial libraries http://stackoverflow.com/questions/11815873/memoization-library-for-python-2-7

jeromekelleher commented 9 years ago

Very nice --- let's use that unless there's a good reason not to.

dcolligan commented 9 years ago

The size of the cache should also be a configurable variable in serverconfig.py

eelsther commented 9 years ago

Hi, I am Esther from LShift, working on this issue at the moment. The only drawback that I see currently in using functools32 lru_cache function is that it is not obvious that the file is closed when the handle is removed from the cache. There is no way to add some kind of hook when an item is deleted from the cache... Another thing is testing! Do you know how I could test the performance improvement? I guess I would need to have more data files in the data directory. Do you have any hints on where I could find more data files?

Thanks

Esther

jeromekelleher commented 9 years ago

@eelsther, great to hear you're working on this. To be honest, I'd be surprised if the functools32 cache ended up fitting our requirements exactly, so I wouldn't try too hard to make it fit. It should be quite easy to make the priority queue (or whatever we need) using something like the heapq module. So, whatever is easiest, I would say go for it.

In terms of performance testing, I would just make a few thousand copies of the files we have for testing. Some data files (VCF and BAM) are kept in tests/data/dataset1. A good scenario would be having (say) a hundred datasets, each containing a hundred variantsets and readgroupsets. Once these have unique names, it makes no difference if they all contain identical data.

bioinformed commented 9 years ago

The other factor here is that we may need several LRU/LFU caches.

  1. variant files, primarily to limit the number of open file handles.
  2. variant index data. These are read into memory and can consume a few hundred KB to a few MB. Depending on OS file caching, these re-reading an index can be much more expensive than re-opening a variant file.
  3. [Future] variant file header information. Parsing variant file header information can be quite expensive, particularly if the headers contain extensive metadata. It is possible to cache the header and when re-opening a variant file, skip the header and seek to the start of the data blocks. I'm working to create a "re-open" mode to pysam that will make this easy.

-Kevin

On Mon, Jun 15, 2015 at 9:40 AM, Jerome Kelleher notifications@github.com wrote:

@eelsther https://github.com/eelsther, great to hear you're working on this. To be honest, I'd be surprised if the functools32 cache ended up fitting our requirements exactly, so I wouldn't try too hard to make it fit. It should be quite easy to make the priority queue (or whatever we need) using something like the heapq module. So, whatever is easiest, I would say go for it.

In terms of performance testing, I would just make a few thousand copies of the files we have for testing. Some data files (VCF and BAM) are kept in tests/data/dataset1. A good scenario would be having (say) a hundred datasets, each containing a hundred variantsets and readgroupsets. Once these have unique names, it makes no difference if they all contain identical data.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/server/issues/441#issuecomment-112073338.

eelsther commented 9 years ago

Thanks for your comments. I have finally decided to use an ordered dictionary for the cache which works pretty well. The server startup is of course much faster now! I haven't pushed my changes yet, still testing...

I wanted to know if you had any FASTA files for references that would help me to test the cache. Once I have the files, I can easily figure out a query but it would be great if you could give me any query that you have in mind.

Thanks!

pgrosu commented 9 years ago

@eelsther An OrderedDict might become very slow - below is a link for a quick explanation:

http://stackoverflow.com/questions/18422995/why-is-ordereddict-10x-slower-than-dict-and-list

~p

jeromekelleher commented 9 years ago

Great news @eelsther! The simplest thing to do with FASTA files is probably just to generated some random ones using the script at https://github.com/ga4gh/server/blob/develop/scripts/generate_fasta.py

eelsther commented 9 years ago

Thank you @jeromekelleher I'll have a look at the script.

@pgrosu : thanks for your comment, I wasn't aware of that. I don't think that's really surprising: I think that the fact it is ordered can indeed cause a slowness (and as stated on stackoverflow the fact that it is written in Python rather than C), but at least retrieving the smallest element is simple and effective. I also tried to use a heapq structure coupled with a dictionary but I don't think it faster than an ordered dictionary and it adds more complexity to the code. I also looked at deque which might be interesting but haven't tested it yet.

pgrosu commented 9 years ago

@eelsther I agreed and maybe a priority queue might also help as @jeromekelleher suggested earlier :) You could even add a secondary lookup structure to drive the time complexity down to O(1), while performing the cleanup in the main structure as a background service - again many options :)

eelsther commented 9 years ago

I finally decided to use a deque for the cache which is indeed faster than OrderedDict. Insertion and deletion are both O(1). Updating an entry is O(n) in the worst case. I also use a dictionary to memoize the function opening the file handles. When the dictionary reaches the configurable max size, the least recently used file is removed from the deque.

pgrosu commented 9 years ago

@eelsther That's fantastic news - thank you for helping us with this!

~p