Open luchs3 opened 6 years ago
Is the client or the server using the cpu? What server? What kind of data (int, float)?
The client uses the cpu. It is a NI cRIO (Intel Atom Prcessor). It creates about 150k values per second on different Nodes and serves it as History data. Its all Float.
what type og opc-ua server I meant... Maybe try to run you program with timeit: https://docs.python.org/3/library/timeit.html Then we can see what method is using all that cpu
It's an additional Labview module (http://sine.ni.com/nips/cds/view/p/lang/de/nid/215329) How do you want me to use timeit?
I executed the script with the cprofiler.
I guess the for loop in the uaprotocol_auto.py at line 9410 causes the cpu usage. ...or binary conversion in uatypes.py at 1141
Are you trying to get 20k values from one node, and you have 20 nodes (400k values)? Or are you saying it's slow getting 1k values from a single node.
Have you tried running your python client in PyPy?
Here I have the stat sorted by used time. Hope this helps. I don't know where the zipimporter comes from. btw. I use python 3.6
@zerox1212 Not every node produces 20k but I would end up with about 200k values per second. I know there are methods with less overhead, but the NI cRio handles the data quite easily with OPC UA and I like the method. So if this problem could be solved, it would be possible to transport sensor data at a quite high sample rate (Timestamp can go down to one microsecond).
Looks like the binary parsing is taking time... Need to have a closer look... Might be hard to fix
Are you using the pip release of python opcua or the latest master from Github? If I were you I would still try running it on PyPy and see if you get more performance.
Could it be a result of your binary changes @oroulet?
I tried the latest one and it makes no difference. For now I failed running it on PyPy but continue trying it. Anyhow does it seem to use more resources than it would have to, compared to the Labview implementation. I'd like to help, but the binary section is a little bit to high for my level.
lastest one from master?the code has changed there. But there is no reason for it to be faster...should be a little slower
In fact it is slower. Python3.6 with 0.90.3 takes about 0.8 sec and Python2.7 with 0.90.4 takes about 2 sec. Don't know how much effect the different Python versions have, but that's quite a difference.
Interessting, I knew it could be slower but could find a case that could show it. Looks like you found one...I will have a look but I am very very busy currently...
Why can't you run it on PyPy?
Do you have the same issue just reading a value? myvar.get_value() How long does it take? on my pc it seems to take almost a second to read a variable with 20000 floats
Sorry 0.7 seconds to read an array of 200.000. 7 seconds to read 2.000.000
If I read a single history value (float array with 65k values), it takes 0.07sec with python3.6 and 0.90.3 and 0.167sec with python2.7 and 0.90.4. For now I can't try more values because of an timeout error. But I guess it wont be much slower.
Actually I found the limit of 65536 elements. So I think this is a limit set by the specification?
Can you register a session with wireshark? I would like to see what is received i order to see if it it possible to increase speed. but decoding of ua is complexed so we may reach the limits of a pure python implementation...
Can you test master now? Just merged a performance pr
Oh great! I wanted to run a wireshark session in a bit. Do you still need it? I'll test it in a minute.
Ok, so now it takes about 4.5 seconds.
Can you be more specific? Is this better or has it all become slower (4.5 seconds per node?) I made the PR for a server-side performance issue, but didn't check for client-side.
Ok sorry, before it took about 2 sec to gather 65k history values of one node at python2.6. (0.90.4) Now with the same setting it takes about 4.5sec. So it takes more than double the time.
Either this is due to the changes between 0.90.4 and master -or- due to my PR. Could you check? Thanks!
There is something strange here. Someone will need to make a better way to test this...
Imho the following elements play a role in this discussion:
This issue is about client binary tcp -> ua conversion, above PR is about performance of restoring ua objects from binary data from the sql history (server). It could be related if the way unpacking is done has the same type of bottleneck.
There may be a performance drop since 0.90.3 0.90.4 (I have the impression that it is also at server-side). Performance is either response time of the server->client link or cpu usage.
Performance of trollius (Py2) vs native asyncio (Py3) may also impact the performance, for the tcp handing at server-side.
Edit: I can confirm that the performance loss (for my test case 24sec -> 39sec = 60% slower) appears at commit e1067baccafcbd0e8c711f6ade0162c792c0e623
Edit2: It seems that unpack_uatype is to be blamed. I think the new switch-like operation (https://github.com/FreeOpcUa/python-opcua/pull/490) must become on par with the original x.from_binary().
I think the new switch-like operation (#490) must become on par with the original x.from_binary().
The new API/functionnnality make creating custom structures much much more easier and remove a lot of code, so I would really like to keep it. But trying to find out how to improve performance is interesting...
import cProfile
from opcua import ua
from opcua.ua.ua_binary import struct_to_binary, struct_from_binary
from opcua.common.utils import Buffer
class MyClass(object):
ua_types = [
('st', 'String'),
('stl', 'ListOfString'),
('u32l', 'ListOfUInt32'),
]
st = "mystring"
stl = [str(i) for i in range(80000)]
u32l = [i for i in range(80000)]
if __name__ == "__main__":
m = MyClass()
cProfile.run('b = struct_to_binary(m)')
cProfile.run('r = struct_from_binary(MyClass, Buffer(b))')
python binary_tests.py 2720045 function calls in 0.890 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.890 0.890
3040046 function calls (2880046 primitive calls) in 1.225 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.225 1.225
@brubbel I am suprised that this code does not seem to be improved by your patch, but I have not looked in detail yet
The code of the PR is not called in your test. I'll check.
@brubbel no it is not called, we should also optimize from_binary function. I tried to merge these methods but never found a good solution...
unpacking now down to 0.457 seconds (almost 1/3) after https://github.com/FreeOpcUa/python-opcua/commit/e6b4d48c17947b20fa395c66b10e544f65b60e74 might be possible to do the same for packing..
improve packing e6b4d48c17947b20fa395c66b10e544f65b60e74
@luchs3 can you test again with master?
Nice! I have the impression that there is a huge improvement in responsiveness since https://github.com/FreeOpcUa/python-opcua/pull/490 For my test case, it is down from 39sec->8sec.
I also found that caching in unpack_uatype() even brings it down to 6sec. However, I am not yet fully confident about the implications of this 'hack':
_unpack_uatype_cache = {}
def unpack_uatype(vtype, data):
try:
return _unpack_uatype_cache[id(vtype)](data)
except:
assert(len(_unpack_uatype_cache) < 100)
print "cache miss"
pass
if hasattr(Primitives, vtype.name):
st = getattr(Primitives, vtype.name)
_unpack_uatype_cache[id(vtype)] = st.unpack
return st.unpack(data)
elif vtype.value > 25:
_unpack_uatype_cache[id(vtype)] = Primitives.Bytes.unpack
return Primitives.Bytes.unpack(data)
elif vtype == ua.VariantType.ExtensionObject:
_unpack_uatype_cache[id(vtype)] = extensionobject_from_binary
return extensionobject_from_binary(data)
elif vtype in (ua.VariantType.NodeId, ua.VariantType.ExpandedNodeId):
_unpack_uatype_cache[id(vtype)] = nodeid_from_binary
return nodeid_from_binary(data)
elif vtype == ua.VariantType.Variant:
_unpack_uatype_cache[id(vtype)] = variant_from_binary
return variant_from_binary(data)
else:
if hasattr(ua, vtype.name):
klass = getattr(ua, vtype.name)
_unpack_uatype_cache[id(vtype)] = lambda data: struct_from_binary(klass, data)
return struct_from_binary(klass, data)
else:
raise UaError("Cannot unpack unknown variant type {0!s}".format(vtype))
Left: https://github.com/FreeOpcUa/python-opcua/pull/490, right: current with caching test for history unpacking.
@brubbel i have a hard time seeing how caching could help in real life situations. Data changes all the time
It is the unpack method that is being cached for each vtype object, not the data.
When unpack_uatype is called, there is a single dict lookup for known vtypes (using the vtype hash) and the previously cached unpack routine is applied in a single step, without going through the whole if-else decision tree again.
At startup, I see 15 cache misses, after that the remainder of the unpack_uatype() is never touched again.
The only thing that makes this hack 'not ready for production' is that there must be checking for unbounded growth of the _unpack_uatype_cache if some vtype instance is created over and over, causing the dict to grow indefinitely. But for what I can see now, this is not the case.
For example, say id(ua.VariantType.NodeId) = 45353453
, then _unpack_uatype_cache[45353453]
returns a reference to nodeid_from_binary
and _unpack_uatype_cache[45353453](data)
is essentially the same as:
if ....
elif vtype == ua.VariantType.NodeId:
return nodeid_from_binary(data)
elif ....
but considerably faster.
In CPython, id(vtype) is the address of the object in memory, so it is guaranteed that there are no collisions. Since this code snippet is an essential part of history retrieval, I think it's performance should be tweaked to maximum.
@brubbel I completely agree we can try to speed up encoding/decoding. btw what program are you using for the scrrenshot above? Any suggestions how to better identify where time is spend in encoding/decoding?
cProfile -> pyprof2calltree -> KCachegrind
one-liner: pyprof2calltree -k -i foobar.prof
I have been profiling the trollius/asyncio event loop for this (opcua.common.utils). Then let, for example, uaExpert pull a bunch of data from history and exit the server to dump the profile stats:
import cProfile
<....>
def run(self):
self.logger.debug("Starting subscription thread")
self.loop = asyncio.new_event_loop()
asyncio.set_event_loop(self.loop)
with self._cond:
self._cond.notify_all()
#profile start
if True:
prof = cProfile.Profile()
prof.runcall(self.loop.run_forever)
CID = threading.current_thread().ident
prof.dump_stats('{:s}_{:d}.prof'.format('event_loop', CID))
#profile end
else:
self.loop.run_forever()
self.logger.debug("subscription thread ended")
More info: https://julien.danjou.info/blog/2015/guide-to-python-profiling-cprofile-concrete-case-carbonara
@brubbel Thanks. Looks like you thought a lot about it ;-)
Thanks for your effort. I tested now the different versions, but it's just slightly faster than before. The same request takes about 4.5s for 65k history values at python3.6. It doubles the time with the changes of 6581f2c. Before it is much faster.
@luchs3 what do you mean? #6581f2c has nothing to do with performance. Are you sure it really changes something? What is the status now at master? and compared to 90.3? and 90.4?
Oh, sorry I mean #e1067ba. With this release, it takes 4.5s. With the current version it is slightly faster (4.35s)
and with .90.3? before https://github.com/FreeOpcUa/python-opcua/commit/e1067baccafcbd0e8c711f6ade0162c792c0e623
Before it's at about 2s
Can you show us the client code? and record a wireshark session so we know for sure what is tramsmitted
Shure
Client code:
import time
import datetime
from opcua import Client
from opcua import ua
if __name__ == "__main__":
client = Client("opc.tcp://192.168.1.112:49580")
try:
client.connect()
print("connected")
var1 = client.get_node("ns=2;s=Input.IEPE11")
for x in range (0, 10):
a = time.time()
dt_now = datetime.datetime.utcnow()
endtime = dt_now + datetime.timedelta(seconds=500)
starttime = dt_now - datetime.timedelta(seconds=500)
result = var1.read_raw_history(endtime, starttime)
b = time.time()
print(b - a)
print(len(result))
finally:
try:
client.disconnect()
except Exception as ex:
print("Error2")
Wireshark session: session01.pcapng.gz
Do you already have a clue where the problem could lie?
Looks like i missed last update. Will have a quick look when i get two minutes
I tried to look at the wireshark session, but some reasons wireshark does not manage to reasemble the data for the read response. You might be passing som size limites on wireshark too....
Hi, I try to request about 20k values per second from 20 Nodes. But it takes nearly 1 sec to get this amount of values from one Node with 90% cpu usage. I don't even manipulate the data. Does anyone have an idea where the problem could be?