Performance impart using wilcard bins

walido78 commented 2 years ago

Hello, As per requested, I am opening an issue related to using wildcard bins performance impact. Actually, using this covergroup @vsc.covergroup

class internal_coverage(object):
    def __init__(self,uut):     
        ##Coverpoint with bins for individual carry bits)

        self.caryallcount=ccc.coverpoint(uut, lambda: uut.adder.c.value , bins={
            str(63-i):vsc.wildcard_bin("0b" + "x" * (63-i) + "1" + "x" * (i)) for i in range(64)})

After profiling I get this performance:

   33135764 function calls (33059289 primitive calls) in 26.629 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
65543/44550    0.078    0.000   25.433    0.001 scheduler.py:330(react)
    44548    0.485    0.000   25.367    0.001 scheduler.py:355(_event_loop)
71689/65543    0.271    0.000   24.475    0.000 scheduler.py:744(schedule)
71689/65543    0.068    0.000   23.003    0.000 decorators.py:137(_advance)
71689/65543    0.036    0.000   22.936    0.000 outcomes.py:35(send)
71689/65543    0.038    0.000   22.902    0.000 {method 'send' of 'coroutine' objects}
     7425    0.051    0.000   21.536    0.003 sampler.py:23(sampler)
    22272    0.078    0.000   21.115    0.001 coverage.py:114(sample)
44544/22272    0.156    0.000   20.965    0.001 covergroup_model.py:64(sample)
   103936    0.467    0.000   20.736    0.000 coverpoint_model.py:171(sample)
   950272    1.264    0.000   15.080    0.000 coverpoint_bin_single_wildcard_model.py:27(sample)
  1395712    0.560    0.000   14.080    0.000 coverpoint_model.py:211(get_val)
   697856    0.329    0.000   13.520    0.000 expr_ref_model.py:33(val)
   699904    0.846    0.000    9.573    0.000 handle.py:718(value)
   475136    0.987    0.000    8.995    0.000 coverage_points.py:18(<lambda>)
   699904    2.089    0.000    6.191    0.000 binary.py:97(__init__)
  1175040    0.491    0.000    4.856    0.000 binary.py:291(integer)
  1175040    0.802    0.000    4.365    0.000 binary.py:199(_convert_from_unsigned)
   950272    0.283    0.000    4.189    0.000 binary.py:456(__int__)
   699904    0.487    0.000    4.101    0.000 binary.py:144(assign)
  1175040    3.010    0.000    3.563    0.000 binary.py:37(resolve)
   311808    0.290    0.000    3.502    0.000 coverpoint_bin_single_bag_model.py:75(sample)
   699904    2.972    0.000    3.404    0.000 binary.py:396(binstr)
   699904    2.445    0.000    2.445    0.000 {method 'get_signal_val_binstr' of 'cocotb.simulator.gpi_sim_hdl' objects}
    14848    0.128    0.000    1.854    0.000 coverpoint_bin_collection_model.py:85(sample)
   133632    0.107    0.000    1.539    0.000 coverpoint_bin_single_range_model.py:45(sample)
    74240    0.136    0.000    1.446    0.000 coverage_points.py:55(<lambda>)
  1182729    0.796    0.000    1.349    0.000 handle.py:292(__getattr__)
   118920    0.727    0.000    1.175    0.000 stagemanager.py:55(set_stage_name)
    67849    0.171    0.000    1.006    0.000 scheduler.py:524(_resume_coro_upon)

And by changing the previous covergroup with individual bins which is behaving the same way but in different coverpoints instead of having the same one :

@vsc.covergroup
class internal_coverage(object):
    _inst=None

    def __init__(self,uut):     
        self.points1 = ccc.coverpoint(uut,lambda:uut.adder.c[1], bins={"1":vsc.bin(1)}) 
        self.points2 = ccc.coverpoint(uut,lambda:uut.adder.c[2], bins={"1":vsc.bin(1)})  
        self.points3 = ccc.coverpoint(uut,lambda:uut.adder.c[3], bins={"1":vsc.bin(1)})  
        self.points4 = ccc.coverpoint(uut,lambda:uut.adder.c[4], bins={"1":vsc.bin(1)})  
        # and so on until self.point64

I get this performance :

         25440488 function calls (25364013 primitive calls) in 18.014 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
65543/44550    0.080    0.000   17.785    0.000 scheduler.py:330(react)
    44548    0.472    0.000   17.719    0.000 scheduler.py:355(_event_loop)
71689/65543    0.261    0.000   16.850    0.000 scheduler.py:744(schedule)
71689/65543    0.066    0.000   15.407    0.000 decorators.py:137(_advance)
71689/65543    0.038    0.000   15.341    0.000 outcomes.py:35(send)
71689/65543    0.038    0.000   15.308    0.000 {method 'send' of 'coroutine' objects}
     7425    0.051    0.000   13.952    0.002 sampler.py:23(sampler)
    22272    0.070    0.000   13.453    0.001 coverage.py:114(sample)
44544/22272    0.616    0.000   13.311    0.001 covergroup_model.py:64(sample)
  1024512    0.815    0.000   12.620    0.000 coverpoint_model.py:171(sample)
  1247232    1.469    0.000   10.050    0.000 coverpoint_bin_single_bag_model.py:75(sample)
  1380864    0.522    0.000    7.275    0.000 coverpoint_model.py:211(get_val)
   690432    0.412    0.000    6.753    0.000 expr_ref_model.py:33(val)
   224768    0.292    0.000    2.565    0.000 handle.py:718(value)
   935424    0.461    0.000    2.353    0.000 handle.py:724(__int__)
    14848    0.120    0.000    1.923    0.000 coverpoint_bin_collection_model.py:85(sample)
   935424    0.359    0.000    1.892    0.000 handle.py:801(value)
   224768    0.716    0.000    1.618    0.000 binary.py:97(__init__)
   133632    0.110    0.000    1.615    0.000 coverpoint_bin_single_range_model.py:45(sample)
   935424    1.533    0.000    1.533    0.000 {method 'get_signal_val_long' of 'cocotb.simulator.gpi_sim_hdl' objects}
    74240    0.140    0.000    1.514    0.000 coverage_points.py:109(<lambda>)
  1167881    0.736    0.000    1.234    0.000 handle.py:292(__getattr__)
   118920    0.755    0.000    1.210    0.000 stagemanager.py:55(set_stage_name)
     4609    0.006    0.000    1.070    0.000 decorators.py:257(_advance)
     4609    0.021    0.000    1.055    0.000 calc1_tb.py:105(test_cmds)
   224768    0.113    0.000    1.025    0.000 binary.py:291(integer)
    67849    0.172    0.000    0.997    0.000 scheduler.py:524(_resume_coro_upon)
   224768    0.164    0.000    0.911    0.000 binary.py:199(_convert_from_unsigned)
   224768    0.161    0.000    0.901    0.000 binary.py:144(assign)
   224768    0.627    0.000    0.748    0.000 binary.py:37(resolve)
     4610    0.004    0.000    0.687    0.000 __init__.py:133(fork)

mballance commented 2 years ago

Hi @walido78, this is interesting. I can certainly see the difference from your profiles. I setup a small testcase (see below). The strange thing is that this testcase shows the opposite: 64 wildcard bins in a single coverpoint are more efficient than 64 individual coverpoints. https://github.com/fvutils/pyvsc/blob/79e1b675fdf61f2c57f65d1a847f5de28c917936/ve/unit/test_coverage_wildcard_bins.py#L134-L233

Looking more deeply at the profiles, it appears that accessing the whole value of addr in cocotb takes more time than accessing individual bits of addr. Note, for example, the calls to binary.py:37(resolve) and binary.py:396(binstr) that only appear in the 'wildcard' version of the test.

I do also see an area where PyVSC could help. From the profile, it appears that PyVSC is fetching the coverpoint value each time it is sampled. It should be possible for PyVSC to cache the sampled value and reuse it. This would at least minimize the overhead imposed by cocotb fetching the full value of signals.

mballance commented 2 years ago

Hi @walido78, I've released a new version of PyVSC (0.7.6) that implements per-coverpoint caching of the coverpoint target-expression value. Previously, the target value would be computed each time a bin in the coverpoint was sampled. Now, the target value is sampled once per coverpoint regardless of how many bins are in that coverpoint. I'll be interested to see how the performance changes for you. Unless cocotb is >64x slower fetching the value of 'addr' vs fetching a single bit, your single-coverpoint version should be faster than the 64-coverpoint version now.

walido78 commented 2 years ago

Hi @mballance , I actually tried with the new version and it's wayyy faster than I expected ! The profiling takes 9 seconds instead of 27 seconds ! Thank you very much

Cyc 0015219: INFO     **************************************Profiling ****************************************
         12460588 function calls (12384113 primitive calls) in 9.334 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
65543/44550    0.085    0.000    8.114    0.000 scheduler.py:330(react)
    44548    0.474    0.000    8.045    0.000 scheduler.py:355(_event_loop)
71689/65543    0.248    0.000    7.174    0.000 scheduler.py:744(schedule)
71689/65543    0.065    0.000    5.735    0.000 decorators.py:137(_advance)
71689/65543    0.036    0.000    5.671    0.000 outcomes.py:35(send)
71689/65543    0.035    0.000    5.637    0.000 {method 'send' of 'coroutine' objects}
     7425    0.046    0.000    4.308    0.001 sampler.py:23(sampler)
    22272    0.069    0.000    3.884    0.000 coverage.py:114(sample)
44544/22272    0.193    0.000    3.747    0.000 covergroup_model.py:64(sample)
   103936    0.341    0.000    3.435    0.000 coverpoint_model.py:185(sample)
  1395712    0.284    0.000    1.799    0.000 coverpoint_model.py:225(get_val)
    51968    0.035    0.000    1.516    0.000 expr_ref_model.py:33(val)
   950272    0.663    0.000    1.373    0.000 coverpoint_bin_single_wildcard_model.py:27(sample)
   311808    0.246    0.000    1.212    0.000 coverpoint_bin_single_bag_model.py:75(sample)
   118920    0.750    0.000    1.202    0.000 stagemanager.py:55(set_stage_name)
     4609    0.007    0.000    0.994    0.000 decorators.py:257(_advance)
    67849    0.167    0.000    0.991    0.000 scheduler.py:524(_resume_coro_upon)
     4609    0.021    0.000    0.978    0.000 calc1_tb.py:105(test_cmds)
    61440    0.102    0.000    0.934    0.000 handle.py:718(value)

mballance commented 2 years ago

Excellent, @walido78! Thanks for sharing the updated results!

fvutils / pyvsc

Performance impart using wilcard bins #160