Closed devos50 closed 11 months ago
Is this a crash or a log entry?
This is a log entry, not resulting in a hard crash but it seems to be affecting download speeds (since it might not be able to build the required circuits).
Here is a link to the run.
Just as a sanity check: this is not on testnet with an old community id?
No, they (were) running on the mainnet. Even if they would be running on testnet, the testnet exit nodes run the same code.
Alright thanks. This does not seem like a trivial fix then.
Possibly related to the failures in these Tribler tests? https://github.com/Tribler/tribler/commit/0d332d80e52ae3839d2e57fd75c75b21ee7bd3db
@egbertbouman you have dibs on errors in the TunnelCommunity: do you want to investigate?
@qstokkink Sure, I'll have a look.
Thanks 👍
Download speeds on Windows 7, 64-bit (where this error did not show up):
Download speeds on Windows 10, 64-bit (where this error occurred):
This issue is a bit difficult to debug. I've set up a simple experiment on my local machine that creates and destroys 50000 3-hop circuits but didn't get a single error. While creating circuits on the network I am able to reproduce this on my development machine, but it often takes 700+ circuit creation attempts before triggering the error.
Despite the difficulties of trying to reproduce this, I do have a theory. I think it is caused by delayed created
/extended
messages that are received while another node is still being tried. Since dh_secret
/dh_first_part
now have different values, the MAC verification will fail with the same stacktrace as mentioned in this issue.
We currently use the circuit_id
to keep track of created
/extended
messages, which is why older created
/extended
messages are still being accepted. Unfortunately, there is no easy way to distinguish between old and new messages except to verify the MAC. I'll address this next time we switch to a new community ID by adding an extra identifier. For now, we should just log and drop created
/extended
messages if the MAC verification fails.
I'm not sure that this is related to the speed issue though because Tribler will already drop the messages and any fix will more be about suppressing the stacktrace than actually fixing something. Also, note that there are runs that show speeds comparable to the successful run above but do have the stacktrace (e.g., run #1032 on Win7). To me, it looks more like the tunnels on our Win7 tester perform better than on Win10 for some reason.
Thanks for debugging! This is valuable information. If it's hard to reproduce, and you tested it on Windows 10 64-bit, I think it's fine to remove the high priority
tag from this issue and look at it later. I made this a high priority thing because we upgraded the entire building infrastructure to Python 3.8, and therefore also some critical libraries such as libsodium
. This upgrade might cause unexpected behaviour on different platforms.
I'm not too worried about the occasional "Failed to auth msg" stacktrace. I've made a unittest to trigger a situation with delayed created
/extended
messages, and the resulting stacktrace looks exactly the same.
What does worry me is the frequency with which they happen on Win10 compared to Win7/8. For instance, if you look at run #1049 on Win10 compared to run #1039 on Win7. Both runs last about 11hrs, yet on Win10 we have 466 stacktraces and on Win7 we have only 14. Also, looking at the "circuit states" graph the Win10 tester seems to have major issues with creating circuits.
As you would expect with these kinds of issues, all seems well while running the release build on my Win10 machine.
Sounds like some other variable is at play; Would be nice if test runs produced hardware/OS/lib artifacts
Maybe something is not thread safe, what are the core counts on on the VM/Hardware the tests were run on? Maybe the OS was not updated Maybe a lib version differs
We have been able to reproduce it on our Proxmox machines (which are Windows 10 64-bit machines virtualised with QEMU). The virtualisation layer could have caused this issue, but we haven't seen it in the wild though.
Let me update this issue since it's actually not that high priority/blocker anymore.
I occasionally see the same traceback on my Windows 10 machine:
[tribler-core PID:18840] 2023-04-25 09:44:49,566 - ERROR <community:668> TriblerTunnelCommunity.on_packet_from_circuit(): Exception occurred while handling packet!
Traceback (most recent call last):
File "C:\dev\py-ipv8\ipv8\messaging\anonymization\community.py", line 664, in on_packet_from_circuit
result = handler(source_address, data, circuit_id)
File "C:\dev\py-ipv8\ipv8\messaging\anonymization\community.py", line 42, in wrapper
return func(self, source_address, payload, circuit_id)
File "C:\dev\py-ipv8\ipv8\messaging\anonymization\community.py", line 745, in on_created
self._ours_on_created_extended(circuit, payload)
File "C:\dev\tribler\src\tribler\core\components\tunnel\community\tunnel_community.py", line 452, in _ours_on_created_extended
super()._ours_on_created_extended(circuit, payload)
File "C:\dev\py-ipv8\ipv8\messaging\anonymization\community.py", line 511, in _ours_on_created_extended
shared_secret = self.crypto.verify_and_generate_shared_secret(hop.dh_secret, payload.key,
File "C:\dev\py-ipv8\ipv8\messaging\anonymization\tunnelcrypto.py", line 46, in verify_and_generate_shared_secret
libnacl.crypto_auth_verify(auth, dh_received, shared_secret[:32])
File "C:\dev\tribler\venv\lib\site-packages\libnacl\__init__.py", line 905, in crypto_auth_verify
raise ValueError('Failed to auth msg')
ValueError: Failed to auth msg
Most often, I do not see any such errors for a long time. But if it presents, it sometimes repeats pretty often. From this, it looks to me like it depends on what peers I'm currently connected to.
It seems that my previous fix of tracking created
/extended
messages using a separate identifier (as part of https://github.com/Tribler/py-ipv8/pull/901) has been partially reverted in IPv8 2.9. RetryRequestCache
is now coupled again to a circuit_id
instead of a random identifier. This may be why this error is back.
I do vaguely remember that even with my fix the issue could still occur (especially since the identifier is only 2 bytes), but it should happen a lot less frequently.
The partial revert happened because that was causing the (different) traceback of https://github.com/Tribler/py-ipv8/issues/1044.
Right now, I only see three solutions: 1) tell the originator of the invalid circuit branch to remove their end of the circuit, because they are not selected for the circuit anymore; 2) figure out a different fix for https://github.com/Tribler/py-ipv8/issues/1044; 3) accept that we may get ValueErrors logged.
I'm not sure which of these three options we should prefer. Currently we are using option number 3 but I think there's something to say for each of these.
My 2 cents on each of these solutions:
circuit_id in self.circuits
) like you originally suggested and drop the message if the check fails. The cache will time out by itself.
Of course one could argue that remove_circuit
must clean the request-cache, but that could just as easily be done by iterating over the caches and doesn't really warrant switching to another cache identifier in my opinion.@egbertbouman with https://github.com/Tribler/py-ipv8/pull/1204 in IPv8 2.12 being part of Tribler now, can we close this issue?
@qstokkink You're right, this one can be closed.
Sentry issue: TRIBLER-1JC
The latest release (7.13) is still using IPv8 2.10, according to requirements-core.txt
, so it's expected for this error to still pop up on Sentry occasionally.
Operating System and version Windows 10, 64-bit
Traceback and/or steps to reproduce
I got this stack trace when running the Tribler application tester on Windows 10, 64-bit (mainnet).