History Data Performance

luchs3 commented 6 years ago

Hi, I try to request about 20k values per second from 20 Nodes. But it takes nearly 1 sec to get this amount of values from one Node with 90% cpu usage. I don't even manipulate the data. Does anyone have an idea where the problem could be?


client = Client("opc.tcp://192.168.1.12:49580")
client.connect()
var1 = client.get_node("ns=2;s=Input.IEPE11")
dt_now = datetime.datetime.utcnow()
endtime = dt_now + datetime.timedelta(seconds=30)
starttime = dt_now - datetime.timedelta(seconds=0)
result = var1.read_raw_history(starttime, endtime)

oroulet commented 6 years ago

Is the client or the server using the cpu? What server? What kind of data (int, float)?

luchs3 commented 6 years ago

The client uses the cpu. It is a NI cRIO (Intel Atom Prcessor). It creates about 150k values per second on different Nodes and serves it as History data. Its all Float.

oroulet commented 6 years ago

what type og opc-ua server I meant... Maybe try to run you program with timeit: https://docs.python.org/3/library/timeit.html Then we can see what method is using all that cpu

luchs3 commented 6 years ago

It's an additional Labview module (http://sine.ni.com/nips/cds/view/p/lang/de/nid/215329) How do you want me to use timeit?

luchs3 commented 6 years ago

I executed the script with the cprofiler.

stat.txt

I guess the for loop in the uaprotocol_auto.py at line 9410 causes the cpu usage. ...or binary conversion in uatypes.py at 1141

zerox1212 commented 6 years ago

Are you trying to get 20k values from one node, and you have 20 nodes (400k values)? Or are you saying it's slow getting 1k values from a single node.

Have you tried running your python client in PyPy?

luchs3 commented 6 years ago

Here I have the stat sorted by used time. Hope this helps. I don't know where the zipimporter comes from. btw. I use python 3.6

stat_sorted.txt

@zerox1212 Not every node produces 20k but I would end up with about 200k values per second. I know there are methods with less overhead, but the NI cRio handles the data quite easily with OPC UA and I like the method. So if this problem could be solved, it would be possible to transport sensor data at a quite high sample rate (Timestamp can go down to one microsecond).

oroulet commented 6 years ago

Looks like the binary parsing is taking time... Need to have a closer look... Might be hard to fix

luchs3 commented 6 years ago

I made a profiling on 200 loops. It might help. stat2.txt

zerox1212 commented 6 years ago

Are you using the pip release of python opcua or the latest master from Github? If I were you I would still try running it on PyPy and see if you get more performance.

Could it be a result of your binary changes @oroulet?

luchs3 commented 6 years ago

I tried the latest one and it makes no difference. For now I failed running it on PyPy but continue trying it. Anyhow does it seem to use more resources than it would have to, compared to the Labview implementation. I'd like to help, but the binary section is a little bit to high for my level.

oroulet commented 6 years ago

lastest one from master?the code has changed there. But there is no reason for it to be faster...should be a little slower

luchs3 commented 6 years ago

In fact it is slower. Python3.6 with 0.90.3 takes about 0.8 sec and Python2.7 with 0.90.4 takes about 2 sec. Don't know how much effect the different Python versions have, but that's quite a difference.

oroulet commented 6 years ago

Interessting, I knew it could be slower but could find a case that could show it. Looks like you found one...I will have a look but I am very very busy currently...

zerox1212 commented 6 years ago

Why can't you run it on PyPy?

oroulet commented 6 years ago

Do you have the same issue just reading a value? myvar.get_value() How long does it take? on my pc it seems to take almost a second to read a variable with 20000 floats

oroulet commented 6 years ago

Sorry 0.7 seconds to read an array of 200.000. 7 seconds to read 2.000.000

luchs3 commented 6 years ago

If I read a single history value (float array with 65k values), it takes 0.07sec with python3.6 and 0.90.3 and 0.167sec with python2.7 and 0.90.4. For now I can't try more values because of an timeout error. But I guess it wont be much slower.

luchs3 commented 6 years ago

Actually I found the limit of 65536 elements. So I think this is a limit set by the specification?

oroulet commented 6 years ago

Can you register a session with wireshark? I would like to see what is received i order to see if it it possible to increase speed. but decoding of ua is complexed so we may reach the limits of a pure python implementation...

oroulet commented 6 years ago

Can you test master now? Just merged a performance pr

luchs3 commented 6 years ago

Oh great! I wanted to run a wireshark session in a bit. Do you still need it? I'll test it in a minute.

luchs3 commented 6 years ago

Ok, so now it takes about 4.5 seconds.

brubbel commented 6 years ago

Can you be more specific? Is this better or has it all become slower (4.5 seconds per node?) I made the PR for a server-side performance issue, but didn't check for client-side.

luchs3 commented 6 years ago

Ok sorry, before it took about 2 sec to gather 65k history values of one node at python2.6. (0.90.4) Now with the same setting it takes about 4.5sec. So it takes more than double the time.

brubbel commented 6 years ago

Either this is due to the changes between 0.90.4 and master -or- due to my PR. Could you check? Thanks!

oroulet commented 6 years ago

There is something strange here. Someone will need to make a better way to test this...

brubbel commented 6 years ago

Imho the following elements play a role in this discussion:

This issue is about client binary tcp -> ua conversion, above PR is about performance of restoring ua objects from binary data from the sql history (server). It could be related if the way unpacking is done has the same type of bottleneck.
There may be a performance drop since ~~0.90.3~~ 0.90.4 (I have the impression that it is also at server-side). Performance is either response time of the server->client link or cpu usage.
Performance of trollius (Py2) vs native asyncio (Py3) may also impact the performance, for the tcp handing at server-side.

Edit: I can confirm that the performance loss (for my test case 24sec -> 39sec = 60% slower) appears at commit e1067baccafcbd0e8c711f6ade0162c792c0e623

Edit2: It seems that unpack_uatype is to be blamed. I think the new switch-like operation (https://github.com/FreeOpcUa/python-opcua/pull/490) must become on par with the original x.from_binary().

oroulet commented 6 years ago

I think the new switch-like operation (#490) must become on par with the original x.from_binary().

The new API/functionnnality make creating custom structures much much more easier and remove a lot of code, so I would really like to keep it. But trying to find out how to improve performance is interesting...

oroulet commented 6 years ago

import cProfile

from opcua import ua
from opcua.ua.ua_binary import struct_to_binary, struct_from_binary
from opcua.common.utils import Buffer

class MyClass(object):
    ua_types = [
        ('st', 'String'),
        ('stl', 'ListOfString'),
        ('u32l', 'ListOfUInt32'),
               ]
    st = "mystring"
    stl = [str(i) for i in range(80000)]
    u32l = [i for i in range(80000)]

if __name__ == "__main__":
    m = MyClass()
    cProfile.run('b = struct_to_binary(m)')
    cProfile.run('r = struct_from_binary(MyClass, Buffer(b))')

python binary_tests.py 2720045 function calls in 0.890 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.890 0.890 :1() 320002 0.033 0.000 0.033 0.000 enum.py:592(name) 320002 0.109 0.000 0.141 0.000 types.py:135(get) 160003 0.044 0.000 0.074 0.000 ua_binary.py:127(pack) 160001 0.170 0.000 0.534 0.000 ua_binary.py:159(pack_uatype) 1 0.002 0.002 0.890 0.890 ua_binary.py:210(struct_to_binary) 160001 0.166 0.000 0.816 0.000 ua_binary.py:233(to_binary) 2 0.052 0.026 0.888 0.444 ua_binary.py:255(list_to_binary) 80001 0.075 0.000 0.145 0.000 ua_binary.py:49(pack) 160003 0.030 0.000 0.030 0.000 {built-in method _struct.pack} 1 0.000 0.000 0.890 0.890 {built-in method builtins.exec} 320005 0.055 0.000 0.055 0.000 {built-in method builtins.getattr} 320003 0.060 0.000 0.060 0.000 {built-in method builtins.hasattr} 240002 0.028 0.000 0.028 0.000 {built-in method builtins.isinstance} 80003 0.008 0.000 0.008 0.000 {built-in method builtins.len} 160005 0.013 0.000 0.013 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 80001 0.015 0.000 0.015 0.000 {method 'encode' of 'str' objects} 3 0.008 0.003 0.008 0.003 {method 'join' of 'bytes' objects} 160004 0.025 0.000 0.025 0.000 {method 'startswith' of 'str' objects}

     3040046 function calls (2880046 primitive calls) in 1.225 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 1.225 1.225 :1() 320002 0.037 0.000 0.037 0.000 enum.py:592(name) 320002 0.113 0.000 0.149 0.000 types.py:135(get) 160003 0.107 0.000 0.218 0.000 ua_binary.py:130(unpack) 160001 0.182 0.000 0.753 0.000 ua_binary.py:174(unpack_uatype) 160003/3 0.296 0.000 1.225 0.408 ua_binary.py:433(from_binary) 1 0.000 0.000 1.225 1.225 ua_binary.py:452(struct_from_binary) 80001 0.053 0.000 0.274 0.000 ua_binary.py:58(unpack) 80001 0.052 0.000 0.204 0.000 ua_binary.py:74(unpack) 1 0.000 0.000 0.000 0.000 utils.py:43(init) 240004 0.122 0.000 0.122 0.000 utils.py:60(read) 160003 0.028 0.000 0.028 0.000 {built-in method _struct.unpack} 1 0.000 0.000 1.225 1.225 {built-in method builtins.exec} 320002 0.057 0.000 0.057 0.000 {built-in method builtins.getattr} 320005 0.088 0.000 0.088 0.000 {built-in method builtins.hasattr} 320005 0.035 0.000 0.035 0.000 {built-in method builtins.isinstance} 1 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass} 1 0.000 0.000 0.000 0.000 {built-in method builtins.len} 3 0.000 0.000 0.000 0.000 {built-in method builtins.setattr} 160000 0.013 0.000 0.013 0.000 {method 'append' of 'list' objects} 80001 0.018 0.000 0.018 0.000 {method 'decode' of 'bytes' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 160003 0.025 0.000 0.025 0.000 {method 'startswith' of 'str' objects}

@brubbel I am suprised that this code does not seem to be improved by your patch, but I have not looked in detail yet

brubbel commented 6 years ago

The code of the PR is not called in your test. I'll check.

oroulet commented 6 years ago

@brubbel no it is not called, we should also optimize from_binary function. I tried to merge these methods but never found a good solution...

oroulet commented 6 years ago

unpacking now down to 0.457 seconds (almost 1/3) after https://github.com/FreeOpcUa/python-opcua/commit/e6b4d48c17947b20fa395c66b10e544f65b60e74 might be possible to do the same for packing..

oroulet commented 6 years ago

improve packing e6b4d48c17947b20fa395c66b10e544f65b60e74

@luchs3 can you test again with master?

brubbel commented 6 years ago

Nice! I have the impression that there is a huge improvement in responsiveness since https://github.com/FreeOpcUa/python-opcua/pull/490 For my test case, it is down from 39sec->8sec.

I also found that caching in unpack_uatype() even brings it down to 6sec. However, I am not yet fully confident about the implications of this 'hack':

_unpack_uatype_cache = {}
def unpack_uatype(vtype, data):
    try:
        return _unpack_uatype_cache[id(vtype)](data)
    except:
        assert(len(_unpack_uatype_cache) < 100)
        print "cache miss"
        pass

    if hasattr(Primitives, vtype.name):
        st = getattr(Primitives, vtype.name)
        _unpack_uatype_cache[id(vtype)] = st.unpack
        return st.unpack(data)
    elif vtype.value > 25:
        _unpack_uatype_cache[id(vtype)] = Primitives.Bytes.unpack
        return Primitives.Bytes.unpack(data)
    elif vtype == ua.VariantType.ExtensionObject:
        _unpack_uatype_cache[id(vtype)] = extensionobject_from_binary
        return extensionobject_from_binary(data)
    elif vtype in (ua.VariantType.NodeId, ua.VariantType.ExpandedNodeId):
        _unpack_uatype_cache[id(vtype)] = nodeid_from_binary
        return nodeid_from_binary(data)
    elif vtype == ua.VariantType.Variant:
        _unpack_uatype_cache[id(vtype)] = variant_from_binary
        return variant_from_binary(data)
    else:
        if hasattr(ua, vtype.name):
            klass = getattr(ua, vtype.name)
            _unpack_uatype_cache[id(vtype)] = lambda data: struct_from_binary(klass, data)
            return struct_from_binary(klass, data)
        else:
            raise UaError("Cannot unpack unknown variant type {0!s}".format(vtype))

Left: https://github.com/FreeOpcUa/python-opcua/pull/490, right: current with caching test for history unpacking.

oroulet commented 6 years ago

@brubbel i have a hard time seeing how caching could help in real life situations. Data changes all the time

brubbel commented 6 years ago

It is the unpack method that is being cached for each vtype object, not the data.

When unpack_uatype is called, there is a single dict lookup for known vtypes (using the vtype hash) and the previously cached unpack routine is applied in a single step, without going through the whole if-else decision tree again.

At startup, I see 15 cache misses, after that the remainder of the unpack_uatype() is never touched again.

The only thing that makes this hack 'not ready for production' is that there must be checking for unbounded growth of the _unpack_uatype_cache if some vtype instance is created over and over, causing the dict to grow indefinitely. But for what I can see now, this is not the case.

For example, say id(ua.VariantType.NodeId) = 45353453, then _unpack_uatype_cache[45353453] returns a reference to nodeid_from_binary and _unpack_uatype_cache[45353453](data) is essentially the same as:

if ....
elif vtype == ua.VariantType.NodeId:
    return nodeid_from_binary(data)
elif ....

but considerably faster.

In CPython, id(vtype) is the address of the object in memory, so it is guaranteed that there are no collisions. Since this code snippet is an essential part of history retrieval, I think it's performance should be tweaked to maximum.

oroulet commented 6 years ago

@brubbel I completely agree we can try to speed up encoding/decoding. btw what program are you using for the scrrenshot above? Any suggestions how to better identify where time is spend in encoding/decoding?

brubbel commented 6 years ago

cProfile -> pyprof2calltree -> KCachegrind

one-liner: pyprof2calltree -k -i foobar.prof

I have been profiling the trollius/asyncio event loop for this (opcua.common.utils). Then let, for example, uaExpert pull a bunch of data from history and exit the server to dump the profile stats:

import cProfile
<....>
    def run(self):
        self.logger.debug("Starting subscription thread")
        self.loop = asyncio.new_event_loop()
        asyncio.set_event_loop(self.loop)
        with self._cond:
            self._cond.notify_all()
        #profile start
        if True:
            prof = cProfile.Profile()
            prof.runcall(self.loop.run_forever)
            CID = threading.current_thread().ident
            prof.dump_stats('{:s}_{:d}.prof'.format('event_loop', CID))
        #profile end
        else:
            self.loop.run_forever()
        self.logger.debug("subscription thread ended")

More info: https://julien.danjou.info/blog/2015/guide-to-python-profiling-cprofile-concrete-case-carbonara

oroulet commented 6 years ago

@brubbel Thanks. Looks like you thought a lot about it ;-)

luchs3 commented 6 years ago

Thanks for your effort. I tested now the different versions, but it's just slightly faster than before. The same request takes about 4.5s for 65k history values at python3.6. It doubles the time with the changes of 6581f2c. Before it is much faster.

oroulet commented 6 years ago

@luchs3 what do you mean? #6581f2c has nothing to do with performance. Are you sure it really changes something? What is the status now at master? and compared to 90.3? and 90.4?

luchs3 commented 6 years ago

Oh, sorry I mean #e1067ba. With this release, it takes 4.5s. With the current version it is slightly faster (4.35s)

oroulet commented 6 years ago

and with .90.3? before https://github.com/FreeOpcUa/python-opcua/commit/e1067baccafcbd0e8c711f6ade0162c792c0e623

luchs3 commented 6 years ago

Before it's at about 2s

oroulet commented 6 years ago

Can you show us the client code? and record a wireshark session so we know for sure what is tramsmitted

luchs3 commented 6 years ago

Shure

Client code:

import time
import datetime
from opcua import Client
from opcua import ua

if __name__ == "__main__":
    client = Client("opc.tcp://192.168.1.112:49580")

    try:
        client.connect()
        print("connected")
        var1 = client.get_node("ns=2;s=Input.IEPE11")

        for x in range (0, 10):
            a = time.time()

            dt_now = datetime.datetime.utcnow()
            endtime = dt_now + datetime.timedelta(seconds=500)
            starttime = dt_now - datetime.timedelta(seconds=500)

            result = var1.read_raw_history(endtime, starttime)

            b = time.time()
            print(b - a)
            print(len(result))

    finally: 
        try:
            client.disconnect()
        except Exception as ex:
            print("Error2")

Wireshark session: session01.pcapng.gz

luchs3 commented 6 years ago

Do you already have a clue where the problem could lie?

oroulet commented 6 years ago

Looks like i missed last update. Will have a quick look when i get two minutes

oroulet commented 6 years ago

I tried to look at the wireshark session, but some reasons wireshark does not manage to reasemble the data for the read response. You might be passing som size limites on wireshark too....

FreeOpcUa / python-opcua

History Data Performance #514