ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 91 forks source link

Performance analysis of HtslibVariantSet required #240

Closed jeromekelleher closed 9 years ago

jeromekelleher commented 9 years ago

As part of our removal of wormtable detailed in #210, we need a performance comparison between the HtslibVariantSet and the WormtableVariantSet. This should use the server_benchmark.py script.

Before we do the comparison, we need to support searching by CallSetId in the HtslibVariantSet as detailed in #221.

shajoezhu commented 9 years ago

Performance comparison on BCF, VCF and Wormtable data format. It seems that wormtable is faster than both vcf and bcf. It doesn't appear to have significant difference between bcf and vcf

1000g_2013 BCF vs VCF vs Wormtable

BCF

$ python server_benchmark.py --profile=cpu 1000g_2013.bcf -c "*"
44.35
         83460438 function calls (70428622 primitive calls) in 130.463 seconds

   Ordered by: internal time
   List reduced from 72 to 18 due to restriction <0.25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
4505400/250400   59.465    0.000    0.228    0.000 {method 'iteritems' of 'pysam.cbcf.VariantRecordSample' objects}
2254509/9   25.581    0.000   36.817    4.091 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
4506300/250400   17.777    0.000    1.173    0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:512(_convertGaCall)
  2253600    8.434    0.000    8.712    0.000 pysam/cbcf.pyx:1549(__get__)
 13521600    3.789    0.000    3.789    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
 13536048    2.402    0.000    2.402    0.000 {isinstance}
  2253600    2.311    0.000    2.311    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
 13532418    2.130    0.000    2.130    0.000 {getattr}
  2253600    1.768    0.000    4.021    0.000 pysam/cbcf.pyx:1642(__getitem__)
  2253600    1.744    0.000    1.997    0.000 pysam/cbcf.pyx:1591(__get__)
2255409/909    1.050    0.000   36.817    0.041 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
  2254518    0.793    0.000    1.862    0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:669(<lambda>)
  2254518    0.790    0.000    1.069    0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
  6760800    0.787    0.000    0.787    0.000 pysam/cbcf.pyx:376(is_gt_fmt)
  2253600    0.484    0.000    0.484    0.000 pysam/cbcf.pyx:1719(makeVariantRecordSample)
  2254869    0.401    0.000    0.401    0.000 pysam/cbcf.pyx:1423(__get__)
  2254500    0.291    0.000    0.291    0.000 {method 'append' of 'list' objects}
  2254590    0.279    0.000    0.279    0.000 {method 'get' of 'dict' objects}

VCF

$ python server_benchmark.py --profile=cpu 1000g_2013 -c "*"
43.04
         83460447 function calls (70428631 primitive calls) in 120.795 seconds

   Ordered by: internal time
   List reduced from 73 to 18 due to restriction <0.25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
4505400/250400   56.447    0.000    0.091    0.000 {method 'iteritems' of 'pysam.cbcf.VariantRecordSample' objects}
2254509/9   22.592    0.000   33.869    3.763 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
4506300/250400   16.373    0.000    1.186    0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:512(_convertGaCall)
  2253600    6.232    0.000    6.457    0.000 pysam/cbcf.pyx:1549(__get__)
 13521600    3.685    0.000    3.685    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
 13536048    2.400    0.000    2.400    0.000 {isinstance}
  2253600    2.332    0.000    2.332    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
 13532418    2.261    0.000    2.261    0.000 {getattr}
  2253600    1.856    0.000    3.808    0.000 pysam/cbcf.pyx:1642(__getitem__)
  2253600    1.489    0.000    1.715    0.000 pysam/cbcf.pyx:1591(__get__)
2255409/909    1.075    0.000   33.868    0.037 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
  2254518    0.796    0.000    1.854    0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:669(<lambda>)
  2254518    0.792    0.000    1.058    0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
  6760800    0.689    0.000    0.689    0.000 pysam/cbcf.pyx:376(is_gt_fmt)
  2253600    0.453    0.000    0.453    0.000 pysam/cbcf.pyx:1719(makeVariantRecordSample)
  2254869    0.414    0.000    0.414    0.000 pysam/cbcf.pyx:1423(__get__)
  2254590    0.266    0.000    0.266    0.000 {method 'get' of 'dict' objects}
  2254500    0.264    0.000    0.264    0.000 {method 'append' of 'list' objects}

Wormtable

$ python server_benchmark.py --profile=cpu 1000g_2013.wt -c "*"
20.38
         67769403 function calls (63260403 primitive calls) in 49.761 seconds

   Ordered by: internal time
   List reduced from 56 to 14 due to restriction <0.25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
2254509/9   18.154    0.000   29.583    3.287 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
      900    5.541    0.006   19.922    0.022 /home/joezhu/server/ga4gh/datamodel/variants.py:270(convertVariant)
  2253600    5.430    0.000    5.430    0.000 {zip}
 13521600    4.846    0.000    4.846    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
  2253606    3.110    0.000    3.110    0.000 {map}
  2253600    2.848    0.000    7.239    0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:32(convertVCFGenotype)
 13558821    2.092    0.000    2.092    0.000 {isinstance}
 13532418    1.682    0.000    1.682    0.000 {getattr}
  2253600    1.387    0.000    1.387    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
2255409/909    0.975    0.000   29.582    0.033 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
  2254866    0.917    0.000    0.917    0.000 {method 'split' of 'str' objects}
  2254518    0.787    0.000    1.059    0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
  2254518    0.774    0.000    1.832    0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:669(<lambda>)
  2253600    0.366    0.000    0.366    0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:21(convertVCFPhaseset)

1000g_2011 VCF vs Wormtable

VCF

$ python server_benchmark.py --profile=cpu 1000g_2011 -c "*"
20.19
         44330190 function calls (38640774 primitive calls) in 54.795 seconds

   Ordered by: internal time
   List reduced from 73 to 18 due to restriction <0.25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1963800/109200   17.092    0.000    0.078    0.000 {method 'iteritems' of 'pysam.cbcf.VariantRecordSample' objects}
1964700/109200    8.168    0.000    1.788    0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:512(_convertGaCall)
 983709/9    7.868    0.000   13.460    1.496 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
  1967382    6.840    0.000    6.840    0.000 pysam/cbcf.pyx:296(bcf_array_to_object)
  5896800    2.389    0.000    2.389    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
   982800    2.349    0.000    2.450    0.000 pysam/cbcf.pyx:1549(__get__)
  2948400    1.870    0.000    9.665    0.000 pysam/cbcf.pyx:1642(__getitem__)
  6902097    1.469    0.000    1.469    0.000 {isinstance}
   994479    1.298    0.000    1.756    0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:427(_encodeValue)
  5907618    0.930    0.000    0.930    0.000 {getattr}
   982800    0.865    0.000    0.865    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
      900    0.657    0.001    0.658    0.001 pysam/cbcf.pyx:2032(__next__)
   982800    0.633    0.000    0.719    0.000 pysam/cbcf.pyx:1591(__get__)
984609/909    0.458    0.000   13.459    0.015 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
  4914000    0.424    0.000    0.424    0.000 pysam/cbcf.pyx:376(is_gt_fmt)
   983718    0.343    0.000    0.798    0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:669(<lambda>)
   983718    0.335    0.000    0.456    0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
   984600    0.249    0.000    0.249    0.000 pysam/cbcf.pyx:1423(__get__)

Wormtable

$ python server_benchmark.py --profile=cpu 1000g_2011.wt -c "*"
11.7
         31660092 function calls (29692692 primitive calls) in 25.994 seconds

   Ordered by: internal time
   List reduced from 56 to 14 due to restriction <0.25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 983709/9    8.570    0.000   12.829    1.425 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
      900    4.111    0.005   12.369    0.014 /home/joezhu/server/ga4gh/datamodel/variants.py:270(convertVariant)
   982800    2.745    0.000    2.745    0.000 {zip}
   982806    1.398    0.000    1.398    0.000 {map}
   982800    1.346    0.000    3.358    0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:32(convertVCFGenotype)
  6931878    1.246    0.000    1.246    0.000 {isinstance}
  5896800    1.194    0.000    1.194    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
   994479    1.058    0.000    1.312    0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:259(convertInfoField)
  5907618    0.796    0.000    0.796    0.000 {getattr}
      909    0.758    0.001   13.158    0.014 /home/joezhu/server/ga4gh/datamodel/variants.py:321(getVariants)
   982800    0.692    0.000    0.692    0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
   984600    0.450    0.000    0.450    0.000 {method 'split' of 'str' objects}
984609/909    0.443    0.000   12.829    0.014 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
   983718    0.365    0.000    0.480    0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
jeromekelleher commented 9 years ago

Thanks @shajoezhu, this seems like an excellent result to me. OK, we have a bit of a bottleneck in the HtslibVariantSet code, but we should be able to work around this. Otherwise, it looks like the overhead of parsing VCF and BCF is minimal and we should be able to get back to the wormtable level of performance without too much trouble.

Based on this, I'd vote to plough ahead with the wormtablectomy. Any thoughts?

bioinformed commented 9 years ago

I'm looking into the apparent performance bottleneck in sample iteration ( pysam.cbcf.VariantRecordSample.iteritems). I suspect there is some low-hanging fruit to optimize.

-Kevin

On Fri, Mar 20, 2015 at 5:47 AM, Jerome Kelleher notifications@github.com wrote:

Thanks @shajoezhu https://github.com/shajoezhu, this seems like an excellent result to me. OK, we have a bit of a bottleneck in the HtslibVariantSet code, but we should be able to work around this. Otherwise, it looks like the overhead of parsing VCF and BCF is minimal and we should be able to get back to the wormtable level of performance without too much trouble.

Based on this, I'd vote to plough ahead with the wormtablectomy. Any thoughts?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/server/issues/240#issuecomment-83970330.

pgrosu commented 9 years ago

Hi Kevin,

Anything is possible :) So currently it's just a generator and will only run when accessing it as in lines 1696-1699 of cbcf.pyx:

  def iteritems(self):
    """D.iteritems() -> an iterator over the (key, value) items of D"""
    for key in self:
      yield (key, self[key])

So maybe pysam can be updated for that part to be pre-indexed as a sorted index of hashes, that are stored as a tree. As you can see the possibilities are several :)

Paul

jeromekelleher commented 9 years ago

That sounds great @bioinformed, thanks for looking into it. While you're thinking about this, would it be possible to have a look at the subset_samples code also? At the moment, (I think) you can only call this once for an opened file. For our use case though, we want to keep the file open for a long period of time and call subset_samples on it repeatedly (each time we have a request). Perhaps a context manager would be good here? I.e.,

with varFile.subset_samples(include_sample):
    for record in varFile.fetch(...):
        yield convertRecord(record)

I'm not sure if this is feasible, and this is just an idea for how it might be done. There's absolutely no urgency in this, as the sample subsetting isn't really hurting us. I just thought I'd bring this up while you were thinking about this part of the code.

bioinformed commented 9 years ago

The main reason why subset_samples works only prior to reading records is to ensure that all Python and C object lifetimes are synchronized. Because the header data structure (bcf_hdr_t) is re-written when subsetting, it can lead to situations where Python objects contain references to C data structures that have been freed or reused and thus crash the interpreter (or worse) . I'll look into what it will take to safely allow subset_samples with each fetch or an inexpensive re-open process that avoids reparsing headers and re-reading indicies.

On Fri, Mar 20, 2015 at 12:56 PM, Jerome Kelleher notifications@github.com wrote:

That sounds great @bioinformed https://github.com/bioinformed, thanks for looking into it. While you're thinking about this, would it be possible to have a look at the subset_samples code also? At the moment, (I think) you can only call this once for an opened file. For our use case though, we want to keep the file open for a long period of time and call subset_samples on it repeatedly (each time we have a request). Perhaps a context manager would be good here? I.e.,

with varFile.subset_samples(include_sample): for record in varFile.fetch(...): yield convertRecord(record)

I'm not sure if this is feasible, and this is just an idea for how it might be done. There's absolutely no urgency in this, as the sample subsetting isn't really hurting us. I just thought I'd bring this up while you were thinking about this part of the code.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/server/issues/240#issuecomment-84068807.

jeromekelleher commented 9 years ago

That makes sense - thanks @bioinformed.