Closed jeromekelleher closed 9 years ago
$ python server_benchmark.py --profile=cpu 1000g_2013.bcf -c "*"
44.35
83460438 function calls (70428622 primitive calls) in 130.463 seconds
Ordered by: internal time
List reduced from 72 to 18 due to restriction <0.25>
ncalls tottime percall cumtime percall filename:lineno(function)
4505400/250400 59.465 0.000 0.228 0.000 {method 'iteritems' of 'pysam.cbcf.VariantRecordSample' objects}
2254509/9 25.581 0.000 36.817 4.091 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
4506300/250400 17.777 0.000 1.173 0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:512(_convertGaCall)
2253600 8.434 0.000 8.712 0.000 pysam/cbcf.pyx:1549(__get__)
13521600 3.789 0.000 3.789 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
13536048 2.402 0.000 2.402 0.000 {isinstance}
2253600 2.311 0.000 2.311 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
13532418 2.130 0.000 2.130 0.000 {getattr}
2253600 1.768 0.000 4.021 0.000 pysam/cbcf.pyx:1642(__getitem__)
2253600 1.744 0.000 1.997 0.000 pysam/cbcf.pyx:1591(__get__)
2255409/909 1.050 0.000 36.817 0.041 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
2254518 0.793 0.000 1.862 0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:669(<lambda>)
2254518 0.790 0.000 1.069 0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
6760800 0.787 0.000 0.787 0.000 pysam/cbcf.pyx:376(is_gt_fmt)
2253600 0.484 0.000 0.484 0.000 pysam/cbcf.pyx:1719(makeVariantRecordSample)
2254869 0.401 0.000 0.401 0.000 pysam/cbcf.pyx:1423(__get__)
2254500 0.291 0.000 0.291 0.000 {method 'append' of 'list' objects}
2254590 0.279 0.000 0.279 0.000 {method 'get' of 'dict' objects}
$ python server_benchmark.py --profile=cpu 1000g_2013 -c "*"
43.04
83460447 function calls (70428631 primitive calls) in 120.795 seconds
Ordered by: internal time
List reduced from 73 to 18 due to restriction <0.25>
ncalls tottime percall cumtime percall filename:lineno(function)
4505400/250400 56.447 0.000 0.091 0.000 {method 'iteritems' of 'pysam.cbcf.VariantRecordSample' objects}
2254509/9 22.592 0.000 33.869 3.763 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
4506300/250400 16.373 0.000 1.186 0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:512(_convertGaCall)
2253600 6.232 0.000 6.457 0.000 pysam/cbcf.pyx:1549(__get__)
13521600 3.685 0.000 3.685 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
13536048 2.400 0.000 2.400 0.000 {isinstance}
2253600 2.332 0.000 2.332 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
13532418 2.261 0.000 2.261 0.000 {getattr}
2253600 1.856 0.000 3.808 0.000 pysam/cbcf.pyx:1642(__getitem__)
2253600 1.489 0.000 1.715 0.000 pysam/cbcf.pyx:1591(__get__)
2255409/909 1.075 0.000 33.868 0.037 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
2254518 0.796 0.000 1.854 0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:669(<lambda>)
2254518 0.792 0.000 1.058 0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
6760800 0.689 0.000 0.689 0.000 pysam/cbcf.pyx:376(is_gt_fmt)
2253600 0.453 0.000 0.453 0.000 pysam/cbcf.pyx:1719(makeVariantRecordSample)
2254869 0.414 0.000 0.414 0.000 pysam/cbcf.pyx:1423(__get__)
2254590 0.266 0.000 0.266 0.000 {method 'get' of 'dict' objects}
2254500 0.264 0.000 0.264 0.000 {method 'append' of 'list' objects}
$ python server_benchmark.py --profile=cpu 1000g_2013.wt -c "*"
20.38
67769403 function calls (63260403 primitive calls) in 49.761 seconds
Ordered by: internal time
List reduced from 56 to 14 due to restriction <0.25>
ncalls tottime percall cumtime percall filename:lineno(function)
2254509/9 18.154 0.000 29.583 3.287 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
900 5.541 0.006 19.922 0.022 /home/joezhu/server/ga4gh/datamodel/variants.py:270(convertVariant)
2253600 5.430 0.000 5.430 0.000 {zip}
13521600 4.846 0.000 4.846 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
2253606 3.110 0.000 3.110 0.000 {map}
2253600 2.848 0.000 7.239 0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:32(convertVCFGenotype)
13558821 2.092 0.000 2.092 0.000 {isinstance}
13532418 1.682 0.000 1.682 0.000 {getattr}
2253600 1.387 0.000 1.387 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
2255409/909 0.975 0.000 29.582 0.033 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
2254866 0.917 0.000 0.917 0.000 {method 'split' of 'str' objects}
2254518 0.787 0.000 1.059 0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
2254518 0.774 0.000 1.832 0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:669(<lambda>)
2253600 0.366 0.000 0.366 0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:21(convertVCFPhaseset)
$ python server_benchmark.py --profile=cpu 1000g_2011 -c "*"
20.19
44330190 function calls (38640774 primitive calls) in 54.795 seconds
Ordered by: internal time
List reduced from 73 to 18 due to restriction <0.25>
ncalls tottime percall cumtime percall filename:lineno(function)
1963800/109200 17.092 0.000 0.078 0.000 {method 'iteritems' of 'pysam.cbcf.VariantRecordSample' objects}
1964700/109200 8.168 0.000 1.788 0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:512(_convertGaCall)
983709/9 7.868 0.000 13.460 1.496 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
1967382 6.840 0.000 6.840 0.000 pysam/cbcf.pyx:296(bcf_array_to_object)
5896800 2.389 0.000 2.389 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
982800 2.349 0.000 2.450 0.000 pysam/cbcf.pyx:1549(__get__)
2948400 1.870 0.000 9.665 0.000 pysam/cbcf.pyx:1642(__getitem__)
6902097 1.469 0.000 1.469 0.000 {isinstance}
994479 1.298 0.000 1.756 0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:427(_encodeValue)
5907618 0.930 0.000 0.930 0.000 {getattr}
982800 0.865 0.000 0.865 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
900 0.657 0.001 0.658 0.001 pysam/cbcf.pyx:2032(__next__)
982800 0.633 0.000 0.719 0.000 pysam/cbcf.pyx:1591(__get__)
984609/909 0.458 0.000 13.459 0.015 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
4914000 0.424 0.000 0.424 0.000 pysam/cbcf.pyx:376(is_gt_fmt)
983718 0.343 0.000 0.798 0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:669(<lambda>)
983718 0.335 0.000 0.456 0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
984600 0.249 0.000 0.249 0.000 pysam/cbcf.pyx:1423(__get__)
$ python server_benchmark.py --profile=cpu 1000g_2011.wt -c "*"
11.7
31660092 function calls (29692692 primitive calls) in 25.994 seconds
Ordered by: internal time
List reduced from 56 to 14 due to restriction <0.25>
ncalls tottime percall cumtime percall filename:lineno(function)
983709/9 8.570 0.000 12.829 1.425 /home/joezhu/server/ga4gh/protocol.py:78(toJsonDict)
900 4.111 0.005 12.369 0.014 /home/joezhu/server/ga4gh/datamodel/variants.py:270(convertVariant)
982800 2.745 0.000 2.745 0.000 {zip}
982806 1.398 0.000 1.398 0.000 {map}
982800 1.346 0.000 3.358 0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:32(convertVCFGenotype)
6931878 1.246 0.000 1.246 0.000 {isinstance}
5896800 1.194 0.000 1.194 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:104(isEmbeddedType)
994479 1.058 0.000 1.312 0.000 /home/joezhu/server/ga4gh/datamodel/variants.py:259(convertInfoField)
5907618 0.796 0.000 0.796 0.000 {getattr}
909 0.758 0.001 13.158 0.014 /home/joezhu/server/ga4gh/datamodel/variants.py:321(getVariants)
982800 0.692 0.000 0.692 0.000 /home/joezhu/server/ga4gh/_protocol_definitions.py:117(__init__)
984600 0.450 0.000 0.450 0.000 {method 'split' of 'str' objects}
984609/909 0.443 0.000 12.829 0.014 /home/joezhu/server/ga4gh/protocol.py:87(<genexpr>)
983718 0.365 0.000 0.480 0.000 /home/joezhu/.local/lib/python2.7/site-packages/avro/schema.py:137(get_prop)
Thanks @shajoezhu, this seems like an excellent result to me. OK, we have a bit of a bottleneck in the HtslibVariantSet code, but we should be able to work around this. Otherwise, it looks like the overhead of parsing VCF and BCF is minimal and we should be able to get back to the wormtable level of performance without too much trouble.
Based on this, I'd vote to plough ahead with the wormtablectomy. Any thoughts?
I'm looking into the apparent performance bottleneck in sample iteration ( pysam.cbcf.VariantRecordSample.iteritems). I suspect there is some low-hanging fruit to optimize.
-Kevin
On Fri, Mar 20, 2015 at 5:47 AM, Jerome Kelleher notifications@github.com wrote:
Thanks @shajoezhu https://github.com/shajoezhu, this seems like an excellent result to me. OK, we have a bit of a bottleneck in the HtslibVariantSet code, but we should be able to work around this. Otherwise, it looks like the overhead of parsing VCF and BCF is minimal and we should be able to get back to the wormtable level of performance without too much trouble.
Based on this, I'd vote to plough ahead with the wormtablectomy. Any thoughts?
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/server/issues/240#issuecomment-83970330.
Hi Kevin,
Anything is possible :) So currently it's just a generator and will only run when accessing it as in lines 1696-1699 of cbcf.pyx:
def iteritems(self):
"""D.iteritems() -> an iterator over the (key, value) items of D"""
for key in self:
yield (key, self[key])
So maybe pysam can be updated for that part to be pre-indexed as a sorted index of hashes, that are stored as a tree. As you can see the possibilities are several :)
Paul
That sounds great @bioinformed, thanks for looking into it. While you're thinking about this, would it be possible to have a look at the subset_samples
code also? At the moment, (I think) you can only call this once for an opened file. For our use case though, we want to keep the file open for a long period of time and call subset_samples on it repeatedly (each time we have a request). Perhaps a context manager would be good here? I.e.,
with varFile.subset_samples(include_sample):
for record in varFile.fetch(...):
yield convertRecord(record)
I'm not sure if this is feasible, and this is just an idea for how it might be done. There's absolutely no urgency in this, as the sample subsetting isn't really hurting us. I just thought I'd bring this up while you were thinking about this part of the code.
The main reason why subset_samples works only prior to reading records is to ensure that all Python and C object lifetimes are synchronized. Because the header data structure (bcf_hdr_t) is re-written when subsetting, it can lead to situations where Python objects contain references to C data structures that have been freed or reused and thus crash the interpreter (or worse) . I'll look into what it will take to safely allow subset_samples with each fetch or an inexpensive re-open process that avoids reparsing headers and re-reading indicies.
On Fri, Mar 20, 2015 at 12:56 PM, Jerome Kelleher notifications@github.com wrote:
That sounds great @bioinformed https://github.com/bioinformed, thanks for looking into it. While you're thinking about this, would it be possible to have a look at the subset_samples code also? At the moment, (I think) you can only call this once for an opened file. For our use case though, we want to keep the file open for a long period of time and call subset_samples on it repeatedly (each time we have a request). Perhaps a context manager would be good here? I.e.,
with varFile.subset_samples(include_sample): for record in varFile.fetch(...): yield convertRecord(record)
I'm not sure if this is feasible, and this is just an idea for how it might be done. There's absolutely no urgency in this, as the sample subsetting isn't really hurting us. I just thought I'd bring this up while you were thinking about this part of the code.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/server/issues/240#issuecomment-84068807.
That makes sense - thanks @bioinformed.
As part of our removal of wormtable detailed in #210, we need a performance comparison between the HtslibVariantSet and the WormtableVariantSet. This should use the
server_benchmark.py
script.Before we do the comparison, we need to support searching by CallSetId in the HtslibVariantSet as detailed in #221.