Open danieljfarrell opened 8 years ago
looks strange, I have never seen this....and it is hard to debug...
I have also witnessed this error, and it always seems to occur during or immediately after a renew of the secure channel (I used WireShark to verify). It is possible that the KeepAlive object is missing some thread synchronization?
After a year I downloaded new version of Kepware OPC server (5.21.112) and python-opcua and I'm getting the same error. (Sometimes I also get "Wrong token id {}, expected {}".) The real number is always higher than the expected one. And it happens quite often.
When I am able to look at it via Wireshark I'll do it (quite annoying on localhost), but right now I can "solve" it quite easily by switching back to one year old version of python-opcua.
Here you got the wireshark captures:
https://ulozto.net/!kifidDy3Orf5/seqnum-zip
Password is opcua123456
In the OK file there is only one thread (Channel1.EVA.WP1), in the NG (not good) file there are two threads (Channel1.EVA.WP1 and Channel1.EVA.WP2).
Client is 192.168.1.1 Server is 192.168.1.228
Interesting part from log of my app:
2016-10-31 16:45:00.131000 write: EVA1 PC_OperatorId = '0109E0C092'
2016-10-31 16:45:00,194 - opcua.client.ua_client.Socket - ERROR - Protocol Error Traceback (most recent call last): File "C:\Python27\lib\site-packages\opcua\client\ua_client.py", line 94, in _run self._receive() File "C:\Python27\lib\site-packages\opcua\client\ua_client.py", line 103, in _receive msg = self._connection.receive_from_socket(self._socket) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 667, in receive_from_socket return self.receive_from_header_and_body(header, utils.Buffer(body)) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 639, in receive_from_header_and_body return self._receive(chunk) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 670, in _receive self._check_incoming_chunk(msg) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 596, in _check_incoming_chunk self.channel.SecurityToken.TokenId)) UaError: Wrong token id 2, expected 1
2016-10-31 16:45:00,209 - opcua.client.ua_client.Socket - ERROR - Protocol Error Traceback (most recent call last): File "C:\Python27\lib\site-packages\opcua\client\ua_client.py", line 94, in _run self._receive() File "C:\Python27\lib\site-packages\opcua\client\ua_client.py", line 103, in _receive msg = self._connection.receive_from_socket(self._socket) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 667, in receive_from_socket return self.receive_from_header_and_body(header, utils.Buffer(body)) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 639, in receive_from_header_and_body return self._receive(chunk) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 670, in _receive self._check_incoming_chunk(msg) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 620, in _check_incoming_chunk .format(self._peer_sequence_number, num)) UaError: Wrong sequence 234 -> 236 (server bug or replay attack)
Rest of the log seems like opcua library got stuck on number 234.
Can you try to use git bisect to find out in what commit this bug was introduced
On Mon, Oct 31, 2016, 17:24 jk987 notifications@github.com wrote:
Here you got the wireshark captures:
https://ulozto.net/!kifidDy3Orf5/seqnum-zip
Password is opcua123456
In the OK file there is only one thread (Channel1.EVA.WP1), in the NG (not good) file there are two threads (Channel1.EVA.WP1 and Channel1.EVA.WP2).
Client is 192.168.1.1 Server is 192.168.1.228
Interesting part from log of my app:
2016-10-31 16:45:00.131000 write: EVA1 PC_OperatorId = '0109E0C092'
2016-10-31 16:45:00,194 - opcua.client.ua_client.Socket - ERROR - Protocol Error
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\opcua\client\ua_client.py", line 94, in _run self._receive() File "C:\Python27\lib\site-packages\opcua\client\ua_client.py", line 103, in _receive msg = self._connection.receive_from_socket(self._socket) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 667, in receive_from_socket return self.receive_from_header_and_body(header, utils.Buffer(body)) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 639, in receive_from_header_and_body return self._receive(chunk) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 670, in _receive self._check_incoming_chunk(msg) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 596, in _check_incoming_chunk self.channel.SecurityToken.TokenId)) UaError: Wrong token id 2, expected 1
2016-10-31 16:45:00,209 - opcua.client.ua_client.Socket - ERROR - Protocol Error
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\opcua\client\ua_client.py", line 94, in _run self._receive() File "C:\Python27\lib\site-packages\opcua\client\ua_client.py", line 103, in _receive msg = self._connection.receive_from_socket(self._socket) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 667, in receive_from_socket return self.receive_from_header_and_body(header, utils.Buffer(body)) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 639, in receive_from_header_and_body return self._receive(chunk) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 670, in _receive self._check_incoming_chunk(msg) File "C:\Python27\lib\site-packages\opcua\ua\uaprotocol_hand.py", line 620, in _check_incoming_chunk .format(self._peer_sequence_number, num)) UaError: Wrong sequence 234 -> 236 (server bug or replay attack)
Rest of the log seems like opcua library got stuck on number 234.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/FreeOpcUa/python-opcua/issues/295#issuecomment-257342024, or mute the thread https://github.com/notifications/unsubscribe-auth/ACcfzjh0c_Rr4GzB9GZipT197NnZ1GoOks5q5hYfgaJpZM4Jr95A .
(Like I wrote you last year, I'm not very familiar with git. But I will try it.) :)
Meanwhile I'm looking to the NG file and when I use filter "tcp.stream eq 1 and opcua" then I can see following:
(1318 16:44:59.932052 token:1 sequence:183 request:183 ReadRequest)
(1324 16:45:00.072890 token:1 sequence:183 request:233 ReadResponse)
1328 16:45:00.130435 token:1 sequence:184 request:184 WriteRequest
1338 16:45:00.175583 token:- sequence:185 request:185 OpenSecureChannelRequest
1340 16:45:00.175927 token:- sequence:234 request:185 OpenSecureChannelResponse
1342 16:45:00.182040 token:2 sequence:235 request:184 WriteResponse
1344 16:45:00.202636 token:2 sequence:186 request:186 ReadRequest
1346 16:45:00.203045 token:2 sequence:236 request:186 ReadResponse
I am happy to help with the git problems. You have done a lot of detective work @jk987 👍 .
Great. That kind of bugs is very very hard to debug
On Mon, Oct 31, 2016, 23:11 Daniel notifications@github.com wrote:
I am happy to help with the git problems. You have done a lot of detective work @jk987 https://github.com/jk987 👍 .
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/FreeOpcUa/python-opcua/issues/295#issuecomment-257435849, or mute the thread https://github.com/notifications/unsubscribe-auth/ACcfzoPc5Kc3ZJAWHwN1FBG813LuEBHjks5q5md1gaJpZM4Jr95A .
@jk987 do you have Python code that demonstrates this bug? Using that to generate a pass/fail criteria I can write an automated script that uses git bisect to isolate the commit that introduced this bug.
Unfortunatelly, my code is quite a big mess. Then you also need proper OPC server and proper PLC to test this. But I'm quite interested how the skeleton of your automated script would look like.
(And more unfortunatelly, my boss gave more important job for the next couple of weeks.)
Anyway, I was looking into the history of commits and I think that finding the right one won't be that hard. :)
I will try to have a look. That kind of bugs are important. Can I download that server somewhere? Is this error random or does it happens at specific time interval?
Server is this one:
https://www.kepware.com/en-us/products/kepserverex/suites/mitsubishi-suite/
Demo (the same software but without licence) works for 2 hours, then you have to restart service, I'm using it this way. It has also some simulation mode but I have never tried it because I always have some PLC on the table.
In my app this error happens in about a minute after the start. That is fine for debugging.
So my manual "bisect" results:
2016-01-04 14:47: https://github.com/FreeOpcUa/python-opcua/tree/39e3a62edd11b471fafc7c2036c4438c04128f71 Works. (Because there isn't such a check?)
Following results are influenced by the fact that the error doesn't occur after 1 minute as I observered before, but after 3 or 5 minutes. So first I supposed everything is OK, but then I noticed if I wait longer then I also get error. :(
2016-01-03 20:41 (earlier date but later commit):
https://github.com/FreeOpcUa/python-opcua/tree/8c5104c85b656e4e915ad278b06efac7db06b79d
Works. (Unfortunately, that was my tip.) No, after 3 minutes wrong token id.
2016-03-29 12:43: https://github.com/FreeOpcUa/python-opcua/tree/d502e106d34d58cf5cbedffda85bcd25cbadecfd Wrong token id after 4 minutes.
2016-09-01 12:29:
https://github.com/FreeOpcUa/python-opcua/tree/3976fb13a7b5fc2e881f6a3fea2e8a60a9754eb3
Works. Doesn't work after 4 minutes.
2016-09-23 06:33
https://github.com/FreeOpcUa/python-opcua/commit/977f1f8bd412157e0f642288360d97c0b5966e6f
Works. Doesn't work (after 3 minutes).
2016-10-22 21:23: https://github.com/FreeOpcUa/python-opcua/tree/5d10d129d619f8cd071b1c886603b0dae275e657 Doesn't work (after 4 minutes).
going back, 2016-10-12 20:36: https://github.com/FreeOpcUa/python-opcua/tree/ae98a12d1dafd46b84e7e2c7c9f378403e18818d Doesn't work (after 1 minute)
So, we can see that the bug is there since the first time the check is there. And now how to get out of it? (Isn't the bug in Kepware? Does specification say anything about sending response for later request earlier?)
Ok that make sense we did not touch that code for a long time
On Sat, Nov 5, 2016, 18:11 jk987 notifications@github.com wrote:
So my manual "bisect" results:
2016-01-04 14:47:
https://github.com/FreeOpcUa/python-opcua/tree/39e3a62edd11b471fafc7c2036c4438c04128f71 Works. (Because there isn't such a check?)
Following results are influenced by the fact that the error doesn't occur after 1 minute as I observered before, but after 3 or 5 minutes. So first I supposed everything is OK, but then I noticed if I wait longer then I also get error. :(
2016-01-03 20:41 (earlier date but later commit):
https://github.com/FreeOpcUa/python-opcua/tree/8c5104c85b656e4e915ad278b06efac7db06b79d Works. (Unfortunately, that was my tip.) No, after 3 minutes wrong token id.
2016-03-29 12:43:
https://github.com/FreeOpcUa/python-opcua/tree/d502e106d34d58cf5cbedffda85bcd25cbadecfd Wrong token id after 4 minutes.
2016-09-01 12:29:
https://github.com/FreeOpcUa/python-opcua/tree/3976fb13a7b5fc2e881f6a3fea2e8a60a9754eb3 Works. Doesn't work after 4 minutes.
2016-09-23 06:33 977f1f8 https://github.com/FreeOpcUa/python-opcua/commit/977f1f8bd412157e0f642288360d97c0b5966e6f Works. Doesn't work (after 3 minutes).
2016-10-22 21:23:
https://github.com/FreeOpcUa/python-opcua/tree/5d10d129d619f8cd071b1c886603b0dae275e657 Doesn't work (after 4 minutes).
going back, 2016-10-12 20:36:
https://github.com/FreeOpcUa/python-opcua/tree/ae98a12d1dafd46b84e7e2c7c9f378403e18818d Doesn't work (after 1 minute)
So, we can see that the bug is there since the first time the check is there. And now how to get out of it? (Isn't the bug in Kepware? Does specification say anything about sending response for later request earlier?)
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/FreeOpcUa/python-opcua/issues/295#issuecomment-258625239, or mute the thread https://github.com/notifications/unsubscribe-auth/ACcfzt8-HOjD-y5tB0rtyicC1avAfE-Wks5q7LiqgaJpZM4Jr95A .
I tried to have a look at this. The error comes from line594 in uaprotocol_hand.py. I do not understand what is happening. but can you try to disable the check and see if it helps? it would be great if you could print the tokenit and expected token id around that line, then send me the output when you get an error. It looks like we receive a pakcet with the new token id before we send it.....
I'm getting closer so couple notices:
For testing it is very good to shorten period of calling OpenSecureChannel. Then the probability of collision is much higher (couple seconds after start).
... WriteRequest is sent. OpenSecureChannelRequest is sent. OpenSecureChannelResponse is received by UASocketClient._receive() then callback is called and UASocketClient.open_secure_channel() gets its response. But before the SecureConnection.set_channel() is called the other SecureConnection._check_incoming_chunk() with WriteResponse throws exception.
Looks like problem with swapping two threads. One thread is Keepalive. The second one is UASocketClient? Is it possible to block the thread with WriteResponse until the OpenSecureChannelResponse is completely handled? I guess it would solve the problem.
I just fixed a small bug in uaprotocol_hand.py.
AsymmetricAlgorithmHeader.SenderCertificate and .ReceiverCertificateThumbPrint are initialized as None, so __str__()
in that case throws exceptions.
Here you got my fix.
class AsymmetricAlgorithmHeader(uatypes.FrozenClass):
def __str__(self):
return "{}(SecurityPolicy:{}, certificatesize:{}, receiverCertificatesize:{} )".format(
self.__class__.__name__, self.SecurityPolicyURI,
None if self.SenderCertificate is None else len(self.SenderCertificate),
None if self.ReceiverCertificateThumbPrint is None else len(self.ReceiverCertificateThumbPrint))
Make a pr. I will try to have a look again tomorrow
On Sat, Nov 5, 2016, 22:10 jk987 notifications@github.com wrote:
I'm getting closer so couple notices:
It is very good to shorten period of calling OpenSecureChannel. Then the probability of collision is much higher (couple seconds after start).
... WriteRequest is sent. OpenSecureChannelRequest is sent. OpenSecureChannelResponse is received by UASocketClient._receive() then callback is called and UASocketClient.open_secure_channel() gets its response. But before the SecureConnection.set_channel() is called the other SecureConnection._check_incoming_chunk() with WriteResponse throws exception.
Looks like problem with swapping two threads. One thread is Keepalive. The second one is UASocketClient? Is it possible to block the thread with WriteResponse until the OpenSecureChannelResponse is completely handled? I guess it would solve the problem.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/FreeOpcUa/python-opcua/issues/295#issuecomment-258640877, or mute the thread https://github.com/notifications/unsubscribe-auth/ACcfzjokDScTtYXWyl5Ls70ZRDCaswgVks5q7PDOgaJpZM4Jr95A .
Well, I still don't understant all the threads and classes there but here is my very dirty pig solution:
uaprotocol_hand.py:
from threading import Lock
...
class SecureConnection(object):
def __init__(self, security_policy):
...
self.lock_receive = Lock()
...
def receive_from_header_and_body(self, header, body):
...
if header.MessageType in (MessageType.SecureMessage,
MessageType.SecureOpen,
MessageType.SecureClose):
chunk = MessageChunk.from_header_and_body(self._security_policy,
header, body)
#return self._receive(chunk)
self.lock_receive.acquire()
msg = self._receive(chunk)
if not msg or header.MessageType != MessageType.SecureOpen:
self.lock_receive.release()
return msg
elif ...
ua_client.py:
class UASocketClient(object):
...
def open_secure_channel(self, params):
...
self._connection.set_channel(response.Parameters)
self._connection.lock_receive.release()
return response.Parameters
In human speech: If we get OpenSecureChannelResponse then we acquire a lock until we completely handle this response. Every other response must wait until we release the lock.
Surely it should be done better. This ugly thing will make bugs if you use the same library for server. Maybe that locking mechanism could be extended to all responses like "don't work on any other response until the previous one is completely handled".
good analysis! but fixing with lock is hard. there are many situations where you can end up in dead lock. We have had a similar issue ealier when we received publish reponses before we were finished to process subcsription data. I am trying to find out how we solved it, it was hard, we had to create queue of reponses to process later..... Would be great to avoid having to do something like that again in client
On Sun, 6 Nov 2016 at 00:43 jk987 notifications@github.com wrote:
Well, I still don't understant all the threads and classes there but here is my very dirty pig solution:
uaprotocol_hand.py:
from threading import Lock ... class SecureConnection(object):
def __init__(self, security_policy): ... self.lock_receive = Lock() ... def receive_from_header_and_body(self, header, body): ... if header.MessageType in (MessageType.SecureMessage, MessageType.SecureOpen, MessageType.SecureClose): chunk = MessageChunk.from_header_and_body(self._security_policy, header, body) #return self._receive(chunk) self.lock_receive.acquire() msg = self._receive(chunk) if not msg or header.MessageType != MessageType.SecureOpen: self.lock_receive.release() return msg elif ...
ua_client.py:
class UASocketClient(object): ... def open_secure_channel(self, params): ... self._connection.set_channel(response.Parameters) self._connection.lock_receive.release() return response.Parameters
In human speech: If we get OpenSecureChannelResponse then we acquire a lock until we completely handle this response. Every other response must wait until we release the lock.
Surely it should be done better. This ugly thing will make bugs if you use the same library for server. Maybe that locking mechanism could be extended to all responses like "don't work on any other response until the previous one is completely handled".
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/FreeOpcUa/python-opcua/issues/295#issuecomment-258649970, or mute the thread https://github.com/notifications/unsubscribe-auth/ACcfzorwfDxMm6MmbFppobGNxgwEhjMzks5q7RSxgaJpZM4Jr95A .
(Btw, issue #157 seems to be opposite situtation where newer TokenId is expected but older one still arrived.)
Remarks: I looked into old issues one year back. I found my old issue #48 where we solved some problem with header of ReadRequest and OpenSecureChannelRequest by a lock. Then I found my next old issue #46 where I had a problem with locks in Publish and CreateMonitoredItems. Both was in October last year. In January this year @huazh appeared in issue #103 and #109. He also noticed some problems with threads. But your discussion was too smart for me. I can see he made couple commits and disappeared.
yes but #157 should be solved. If we keep track of older token ids. It is harder to know token ids which are coming in the future........
On Sun, 6 Nov 2016 at 13:47 jk987 notifications@github.com wrote:
Remarks: I looked into old issues one year back. I found my old issue #48 https://github.com/FreeOpcUa/python-opcua/issues/48 where we solved some problem with header of ReadRequest and OpenSecureChannelRequest by a lock. Then I found my next old issue #46 https://github.com/FreeOpcUa/python-opcua/issues/46 where I had a problem with locks in Publish and CreateMonitoredItems. Both was in October last year. In January this year @huazh https://github.com/huazh appeared in issue
103 https://github.com/FreeOpcUa/python-opcua/issues/103 and #109
https://github.com/FreeOpcUa/python-opcua/issues/109. He also noticed some problems with threads. But your discussion was too smart for me. I can see he made couple commits and disappeared.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/FreeOpcUa/python-opcua/issues/295#issuecomment-258678785, or mute the thread https://github.com/notifications/unsubscribe-auth/ACcfzqWqzT_tkZJk0HJyBSOdQLhymJ2kks5q7cxDgaJpZM4Jr95A .
@huazh https://github.com/huazh wanted to use asyncio instead of threads. This is a very good solution, but this is python3 only. the python2 hack is dying...
On Sun, 6 Nov 2016 at 16:10 Olivier Roulet-Dubonnet < olivier.roulet@gmail.com> wrote:
yes but #157 should be solved. If we keep track of older token ids. It is harder to know token ids which are coming in the future........
On Sun, 6 Nov 2016 at 13:47 jk987 notifications@github.com wrote:
Remarks: I looked into old issues one year back. I found my old issue #48 https://github.com/FreeOpcUa/python-opcua/issues/48 where we solved some problem with header of ReadRequest and OpenSecureChannelRequest by a lock. Then I found my next old issue #46 https://github.com/FreeOpcUa/python-opcua/issues/46 where I had a problem with locks in Publish and CreateMonitoredItems. Both was in October last year. In January this year @huazh https://github.com/huazh appeared in issue
103 https://github.com/FreeOpcUa/python-opcua/issues/103 and #109
https://github.com/FreeOpcUa/python-opcua/issues/109. He also noticed some problems with threads. But your discussion was too smart for me. I can see he made couple commits and disappeared.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/FreeOpcUa/python-opcua/issues/295#issuecomment-258678785, or mute the thread https://github.com/notifications/unsubscribe-auth/ACcfzqWqzT_tkZJk0HJyBSOdQLhymJ2kks5q7cxDgaJpZM4Jr95A .
https://github.com/FreeOpcUa/python-opcua/commit/420f43f6034e6eef2faf60742d295db100600689
This is not really a fix, but at least we do not crash. Can you test this and tell me if it solves your issue.
Maybe it's time to split python opcua... leave one that has python2 support, but move the main library forward with python 3? Then e can use asyncio which is definitely the best solution.
@oroulet Shows warnings but works. :)
I think we will keep this as is for now. leave the bug report open so we remember to fix this, when we split tree for python3 only.
btw I suppose that you only see a few warnings at every secure connection renewal?
It was aproximately one warning per five or ten renewals. And only one warning at a moment. (When I was playing with it on Saturday I could see even funny situation when exception was thrown but the message text in exception already had actualized token ids.)
OK thanks. At least we have a clear understanding of the issue now. I think we are fine with the current solution, as long as we do not send a request with wrong token id, then the server might refuse our message... I guess if that can happen, we will hear about it soon...
Doesn't that fix allow client to accept any token? Does it raise a security issue? Just asking questions, I'm not that familiar with this part of the code.
a branch to test for anyone interested: #962
Please tell me if it helps, and send me output if you get a crash again
I am observing an intermittent error,
UaError: Wrong sequence 25336 -> 26016 (server bug or replay attack)
.The system will run fine for several hours then the following error will occur.
The setup: freeopcua client with a KEPware server. Please let me know what specific information would be helpful to include. I am using the latest source code directly from the master branch as of today.
Before updating to the master branch I was observing the same errors as reported here, https://github.com/FreeOpcUa/python-opcua/issues/201 and possibly this one too, https://github.com/FreeOpcUa/python-opcua/issues/157
The error I am observing is raised from the same function that was patched in #201 and #157, https://github.com/FreeOpcUa/python-opcua/blob/14a33c98b0bd67a9e0edfcca3e7d40e00e81e122/opcua/ua/uaprotocol_hand.py#L619
Please let me know how I can help and what you think could be the cause. I stress, this only occurs after running for several hours.