ValueError: Failed to auth msg

devos50 commented 4 years ago

Operating System and version Windows 10, 64-bit

Traceback and/or steps to reproduce

I got this stack trace when running the Tribler application tester on Windows 10, 64-bit (mainnet).

[PID:9376] 2020-07-21 05:29:44,308 - ERROR <community:668> TriblerTunnelCommunity.on_packet_from_circuit(): Exception occurred while handling packet!
Traceback (most recent call last):
  File "ipv8\messaging\anonymization\community.py", line 664, in on_packet_from_circuit
  File "ipv8\messaging\anonymization\community.py", line 73, in wrapper
  File "ipv8\lazy_community.py", line 102, in wrapper
  File "ipv8\messaging\anonymization\community.py", line 71, in inner_wrapper
  File "ipv8\messaging\anonymization\community.py", line 745, in on_created
  File "src\tribler-core\tribler_core\modules\tunnel\community\triblertunnel_community.py", line 441, in _ours_on_created_extended
  File "ipv8\messaging\anonymization\community.py", line 522, in _ours_on_created_extended
  File "ipv8\messaging\anonymization\tunnelcrypto.py", line 46, in verify_and_generate_shared_secret
  File "lib\site-packages\libnacl\__init__.py", line 907, in crypto_auth_verify
ValueError: Failed to auth msg

qstokkink commented 4 years ago

Is this a crash or a log entry?

devos50 commented 4 years ago

This is a log entry, not resulting in a hard crash but it seems to be affecting download speeds (since it might not be able to build the required circuits).

Here is a link to the run.

qstokkink commented 4 years ago

Just as a sanity check: this is not on testnet with an old community id?

devos50 commented 4 years ago

No, they (were) running on the mainnet. Even if they would be running on testnet, the testnet exit nodes run the same code.

qstokkink commented 4 years ago

Alright thanks. This does not seem like a trivial fix then.

qstokkink commented 4 years ago

Possibly related to the failures in these Tribler tests? https://github.com/Tribler/tribler/commit/0d332d80e52ae3839d2e57fd75c75b21ee7bd3db

@egbertbouman you have dibs on errors in the TunnelCommunity: do you want to investigate?

egbertbouman commented 4 years ago

@qstokkink Sure, I'll have a look.

qstokkink commented 4 years ago

Thanks 👍

devos50 commented 4 years ago

Download speeds on Windows 7, 64-bit (where this error did not show up):

speed_down

Download speeds on Windows 10, 64-bit (where this error occurred):

speed_down-2

egbertbouman commented 4 years ago

This issue is a bit difficult to debug. I've set up a simple experiment on my local machine that creates and destroys 50000 3-hop circuits but didn't get a single error. While creating circuits on the network I am able to reproduce this on my development machine, but it often takes 700+ circuit creation attempts before triggering the error.

Despite the difficulties of trying to reproduce this, I do have a theory. I think it is caused by delayed created/extended messages that are received while another node is still being tried. Since dh_secret/dh_first_part now have different values, the MAC verification will fail with the same stacktrace as mentioned in this issue.

We currently use the circuit_id to keep track of created/extended messages, which is why older created/extended messages are still being accepted. Unfortunately, there is no easy way to distinguish between old and new messages except to verify the MAC. I'll address this next time we switch to a new community ID by adding an extra identifier. For now, we should just log and drop created/extended messages if the MAC verification fails.

I'm not sure that this is related to the speed issue though because Tribler will already drop the messages and any fix will more be about suppressing the stacktrace than actually fixing something. Also, note that there are runs that show speeds comparable to the successful run above but do have the stacktrace (e.g., run #1032 on Win7). To me, it looks more like the tunnels on our Win7 tester perform better than on Win10 for some reason.

devos50 commented 4 years ago

Thanks for debugging! This is valuable information. If it's hard to reproduce, and you tested it on Windows 10 64-bit, I think it's fine to remove the high priority tag from this issue and look at it later. I made this a high priority thing because we upgraded the entire building infrastructure to Python 3.8, and therefore also some critical libraries such as libsodium. This upgrade might cause unexpected behaviour on different platforms.

egbertbouman commented 4 years ago

I'm not too worried about the occasional "Failed to auth msg" stacktrace. I've made a unittest to trigger a situation with delayed created/extended messages, and the resulting stacktrace looks exactly the same.

What does worry me is the frequency with which they happen on Win10 compared to Win7/8. For instance, if you look at run #1049 on Win10 compared to run #1039 on Win7. Both runs last about 11hrs, yet on Win10 we have 466 stacktraces and on Win7 we have only 14. Also, looking at the "circuit states" graph the Win10 tester seems to have major issues with creating circuits.

As you would expect with these kinds of issues, all seems well while running the release build on my Win10 machine.

Dmole commented 4 years ago

Sounds like some other variable is at play; Would be nice if test runs produced hardware/OS/lib artifacts

Maybe something is not thread safe, what are the core counts on on the VM/Hardware the tests were run on? Maybe the OS was not updated Maybe a lib version differs

devos50 commented 4 years ago

We have been able to reproduce it on our Proxmox machines (which are Windows 10 64-bit machines virtualised with QEMU). The virtualisation layer could have caused this issue, but we haven't seen it in the wild though.

Let me update this issue since it's actually not that high priority/blocker anymore.

kozlovsky commented 1 year ago

I occasionally see the same traceback on my Windows 10 machine:

[tribler-core PID:18840] 2023-04-25 09:44:49,566 - ERROR <community:668> TriblerTunnelCommunity.on_packet_from_circuit(): Exception occurred while handling packet!
Traceback (most recent call last):
  File "C:\dev\py-ipv8\ipv8\messaging\anonymization\community.py", line 664, in on_packet_from_circuit
    result = handler(source_address, data, circuit_id)
  File "C:\dev\py-ipv8\ipv8\messaging\anonymization\community.py", line 42, in wrapper
    return func(self, source_address, payload, circuit_id)
  File "C:\dev\py-ipv8\ipv8\messaging\anonymization\community.py", line 745, in on_created
    self._ours_on_created_extended(circuit, payload)
  File "C:\dev\tribler\src\tribler\core\components\tunnel\community\tunnel_community.py", line 452, in _ours_on_created_extended
    super()._ours_on_created_extended(circuit, payload)
  File "C:\dev\py-ipv8\ipv8\messaging\anonymization\community.py", line 511, in _ours_on_created_extended
    shared_secret = self.crypto.verify_and_generate_shared_secret(hop.dh_secret, payload.key,
  File "C:\dev\py-ipv8\ipv8\messaging\anonymization\tunnelcrypto.py", line 46, in verify_and_generate_shared_secret
    libnacl.crypto_auth_verify(auth, dh_received, shared_secret[:32])
  File "C:\dev\tribler\venv\lib\site-packages\libnacl\__init__.py", line 905, in crypto_auth_verify
    raise ValueError('Failed to auth msg')
ValueError: Failed to auth msg

Most often, I do not see any such errors for a long time. But if it presents, it sometimes repeats pretty often. From this, it looks to me like it depends on what peers I'm currently connected to.

egbertbouman commented 1 year ago

It seems that my previous fix of tracking created/extended messages using a separate identifier (as part of https://github.com/Tribler/py-ipv8/pull/901) has been partially reverted in IPv8 2.9. RetryRequestCache is now coupled again to a circuit_id instead of a random identifier. This may be why this error is back.

I do vaguely remember that even with my fix the issue could still occur (especially since the identifier is only 2 bytes), but it should happen a lot less frequently.

qstokkink commented 1 year ago

The partial revert happened because that was causing the (different) traceback of https://github.com/Tribler/py-ipv8/issues/1044.

Right now, I only see three solutions: 1) tell the originator of the invalid circuit branch to remove their end of the circuit, because they are not selected for the circuit anymore; 2) figure out a different fix for https://github.com/Tribler/py-ipv8/issues/1044; 3) accept that we may get ValueErrors logged.

I'm not sure which of these three options we should prefer. Currently we are using option number 3 but I think there's something to say for each of these.

egbertbouman commented 1 year ago

My 2 cents on each of these solutions:

You're likely to run into problems if you attempt this. I would just let the timed out node figure things out for itself. Timeouts happen all the time anyway.
To me, the most reasonable fix for the other issue would have been to just check if there is still a circuit object (e.g., circuit_id in self.circuits) like you originally suggested and drop the message if the check fails. The cache will time out by itself. Of course one could argue that remove_circuit must clean the request-cache, but that could just as easily be done by iterating over the caches and doesn't really warrant switching to another cache identifier in my opinion.
The reason that I created a fix for this issue was mostly to avoid confusion for developers.

qstokkink commented 11 months ago

@egbertbouman with https://github.com/Tribler/py-ipv8/pull/1204 in IPv8 2.12 being part of Tribler now, can we close this issue?

egbertbouman commented 11 months ago

@qstokkink You're right, this one can be closed.

sentry-for-tribler[bot] commented 8 months ago

Sentry issue: TRIBLER-1JC

egbertbouman commented 8 months ago

The latest release (7.13) is still using IPv8 2.10, according to requirements-core.txt, so it's expected for this error to still pop up on Sentry occasionally.

Tribler / tribler

ValueError: Failed to auth msg #5513