ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 91 forks source link

Update to Protobuf 3.2 #1623

Open ljdursi opened 7 years ago

ljdursi commented 7 years ago

We're trying to build federated analysis tools on top of the GA4GH reference server across three sites, and consistently find that getting call information is 2-3 orders of magnitude slower (and much more memory intensive) than simply operating on VCF files, which is a show-stopper for us: e.g. the 90s to locally extract the ~3k variants in server_benchmark.py as vs <1s using cyvcf2, also from python.

In trying to track this down we extracted the code path into a script which could be profiled more directly (i.e. gist 1) and while there were some things that could be sped up, one of the big issues was simply the creation of the call objects. Just creating the almost completely empty calls for 100 variants across 2504 samples - not a big WGS analysis! - takes 30s with the python implementation of the protobuf library (i.e., gist 2).

One problem here is that the Python implementation of the Python protobuf libraries is known to be extremely slow. Effective about 3 weeks ago and starting with protobuf 3.2, the google protobuf pypi packages have started distributing manylinux binary wheels of the protobuf packages, meaning that the C++ implementation is being shipped for Linux, https://github.com/google/protobuf/issues/2623. The alternative is to build it oneself, as we were doing with our docker setup for the servers, and as still would be necessary for MacOS and other OSes, although we're looking into conda builds. As shown in the second half of gist 2, this dramatically drops the time to creation of the messages - still slower than a native C++ version, but much closer. The last part of gist 1 shows that this roughly halves the time to extract the calls, and dramatically reduces memory usage for the call objects - including on the client, which is of interest for us. The memory usage looks to be on order of 20x smaller, although comparing the memory usage of the python objects vs the C++-backed objects a little tricky so it may be less than that.

Moving this into (slightly older versions of) the full server, using the C++ implementation gives a roughly 2x speedup when used on the server, and a dramatic memory drop when used on the client gist 3. Another 2x is possible by avoiding the serialization to and from JSON strings, and using native protobuf serialization. It's fairly trivial to add support for Accept: content types of application/protobuf; we have this against the 0.3.5 version of the server here on the server side and here on the client. We could certainly submit a stalking-horse PR for this against the new release when it comes out, although it would involve ga4gh-schemas now too. Given https://github.com/ga4gh/ga4gh-schemas/issues/850 though, it might be worth giving more thought to a nicer general case to cover other content types than what we have here.

We're using these now, and the 4x performance speedup (and corresponding memory requirement drop) is helpful, but it still isn't enough. There are some other possible performance wins on the server side - using cyvcf2 instead of pysam - but the biggest issues after this I think are baked into the schemas and API at this point and so will need to be discussed there. I see that there's an issue for a calls endpoint #1602 which would strip out the variants information, and that would help, but the biggest performance problem is that for a user wanting even modest-sized genotype matrices or related call information, millions of these Call messages must be created, serialized, unpacked, and accessed. There all small changes that can improve things - gist 2 points out that using a repeated int instead of a ListValue can make a surprisingly large improvement - but I think performance will likely require a structure-of-array rather than an array-of-structure approach for calls.

david4096 commented 7 years ago

+1 thanks, the extra analysis is great! I'd like to make this issue closeable by updating our internal requirements for protobuf to 3.2.

Please give me a moment to address some of your other points. For variants, we don't do much other than pysam + de/serialization. Offering an alternative to pysam sounds great and https://github.com/brentp/cyvcf2 seems like an excellent solution. I would be curious to see what code could be shared by the two datamodel classes.

I hope that we can encourage more of your involvement at the schemas because you make excellent points that should be discussed there! The genotype went from a repeated int to a ListValue because we didn't have a way of specifying "no call". We could optimize away the ListValue using a oneof using integers and a "no call made" enum. You can see the discussion that brought about the ListValue here: https://github.com/ga4gh/ga4gh-schemas/issues/727 .

If I can direct your attention to this issue https://github.com/ga4gh/ga4gh-schemas/issues/832, it attempts to answer, for RNA, the problem of requesting a matrix as opposed to a list of expression values. I believe the idea could generalize to variants. Your considerations there would be valuable and creating another issue for optimizing an aggregating variants query should be addressed. Bringing the calls out makes it easier to write the query you want, but still incurs request overhead a single query might avoid.

We'll close this issue when we've upgraded the server to protobuf 3.2. I believe by simply accepting updates on the 3.x release cycle we will be safe within this software. No?

@ljdursi if you can't find another issue within which to address each of your concerns on the schema level, please let me know or make one and we can move from there!

bioinformed commented 7 years ago

@david4096: I'm sorry to hear that psam's VCF/BCF support isn't working well. Please let me know if you're having any specific problems and will let you know if they can be addressed.

ljdursi commented 7 years ago

Thanks, @david4096! Sounds good. As far as I know the 3.x release cycle for protobuf should be safe. There is one caveat - the python implementation was actually more lenient than really should have been allowed, and some have reported issues moving to 3.2 and the C++ implementation for that reason. We haven't had issues, but I haven't run the current test suite, largely because we've been using an older version of the schema/server/client.

Sorry, I see now I could have usefully broken this up into three issues. We're going to try prototyping something for calls / genotype matrices, maybe using https://github.com/ga4gh/ga4gh-schemas/issues/832 as a starting point, and we'll see if we can come up with a concrete working proposal as a starting point for discussion. While doing that we'll probably be playing with ListValue vs enum, etc. Similarly can mock up a protobuf serialization PR against the new release unless other content-types are being already worked on.

@bioinformed - I've found it easier to get fast performance out of cyvcf2 for my use case mentioned here largely because it provides numpy arrays for e.g. gt_types, gt_ref_depth. But that doesn't cover all cases, and I may simply not be aware of how to get the best performance out of pysam for VCF.

david4096 commented 7 years ago

@ljdursi Another interesting thread for your perusal on some early approaches to remodeling the calls. https://github.com/ga4gh/ga4gh-schemas/pull/395

david4096 commented 7 years ago

We've taken similar anecdotal measurements for the binary protobuf. Note that the improvements of switching to binary on the wire protobuf is about 2x. And the improvement of going from python to C++ is about another 2x. Previous tests done by @andrewjesaitis showed that grpc was somewhere near twice as fast as the existing serialization. I wonder if the change in the underlying C++ bindings would realize another double in performance using simply the GRPC.

#URL                                client_apitype  srv_apitype  use_protobuf  nvar  time_per_var  kB_per_var
http://ga4gh.ccm.sickkids.ca/ga4gh  cpp             cpp          False         648   0.149440        65.3102213542
http://ga4gh.ccm.sickkids.ca/ga4gh  cpp             cpp          True          648   0.077935        65.3102213542
http://candig-ga4gh.genap.ca/ga4gh  cpp             python       False         605   0.313741        65.3114669421
http://1kgenomes.ga4gh.org/         cpp             python       False        1065   0.313287        80.9306778169
http://ga4gh.ccm.sickkids.ca/ga4gh  python          cpp          False         648   0.145300      1309.5181086
http://ga4gh.ccm.sickkids.ca/ga4gh  python          cpp          True          648   0.103760      1309.66385754
http://candig-ga4gh.genap.ca/ga4gh python python False 605 0.310714 1309.55414514

To get the best performance you will want to pay the upfront ETL costs to place your data into a distributed store. This lets you balance the costs of indexing, storage, access, and analysis. In concept, a high performance system might index the variant data by position into a store. Then, in a column or field of each variant record, the binary protobuf is stored. Then one can actually compare the tradeoffs of storage versus re-serializing effectively. When an API request is made one simply pays the cost of deserializing the request, looking up within the index, concatenating the messages together, and actual transfer.

We can of course improve our software's performance, but at the schema's level we can do a few things other than simplifying the messages. First by reducing the number of requests needed to answer the same question, we can reduce the amount of roundtrip transfer. In practice we have shied away from merge/join type API calls because they present an order of magnitude increase in complexity to serve, but I look forward to your insights.

The other thing that we can do to optimize the schemas is to offer (as an alternative?) binary-on-the-wire protocol. I believe that this will get you a modest performance improvement (~2x to ~5x according to @ljdursi's analysis). The added complexity cost is low here, especially if one accepts the GRPC implementation.

Streaming in the GRPC style did not give us the performance improvements I had expected when looking at this in the past, but I believe it may have been due to the protobuf bindings being to python, and not C++. We might take another look and find that approach to have greatly improved with the C++ python wheels. https://docs.google.com/document/d/1psUfafYyhgdAYjq1Qm_YAkGc3IDz___4F3nWeR1gkM8/edit?usp=sharing

And finally, I hope we get a chance to look at the federation use cases (schemas issue)!

bioinformed commented 7 years ago

@ljdursi: Accessing INFO and FORMAT values as numpy arrays is a perfectly reasonable feature request. This will push handling of missing values to client code, but that is the trade off for speed. I'll look into adding this functionality to pysam, though I also understand that switching to cyvcf2 may be the easier route to take.