Closed ichorid closed 4 years ago
I recently had this problem when experimenting with threading in asyncio. Could you try to disable all threaded calls?
Disabling Market, RemoteQuery and GigaChannel communities does not help either.
Not sure why this is happening. Considering the error I would expect this to happen a shutdown. If I remember correctly this is the error you get when a task destructor is called without the Task having finished.
For some reason, I can't seem to reproduce this (on Win10, Python 3.8, libtorrent 1.2.1). I've tried to run from source and using the RC1 binaries.
@ichorid Have you tried enabling asyncios debug mode so that you get some more information?
@egbertbouman , I'll try it with asyncio debug enabled.
After some fiddling with asyncio debugging, one of the logs produced this:
[PID:26906] 2020-03-18 19:46:58,810 - ERROR <community:336> DHTDiscoveryCommunity.on_packet(): Exception occurred while handling packet!
Traceback (most recent call last):
File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/pyipv8/ipv8/community.py", line 331, in on_packet
result = handler(source_address, data)
File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/pyipv8/ipv8/lazy_community.py", line 39, in wrapper
return func(self, Peer(auth.public_key_bin, source_address), *unpacked)
File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/pyipv8/ipv8/dht/discovery.py", line 177, in on_connect_peer_response
cache.future.set_result(payload.nodes)
asyncio.base_futures.InvalidStateError: invalid state
[PID:26906] 2020-03-18 19:46:54,123 - ERROR <community:447> DHTDiscoveryCommunity.on_find_response(): Got find-response with unknown identifier, dropping packet
This SO post suggests the problem is related to add_done_callback
thing.
The InvalidStateError happens if you try to set the state more than once. This is usually solved by something like:
if not cache.future.done():
cache.future.set_result(payload.nodes)
(this is similar to what we do here)
In this instance, I'm not sure what's causing this error though. Is there a timeout in the logs just before this stacktrace?
EgbertDHT is not (the only thing) responsible for the crash. Running Tribler with DHT disabled still produces the crash.
Also, during one of the tests, I downloaded a couple of torrents, and when I clicked "Remove with data" Tribler crashed immediately. This points further to add_done_callback
thing, because it is used in the @require_handle
decorator.
Correct, the issue in the DHT should not cause the crash since the packet handlers in IPv8 (whether they are async or not) will only log the error.
So, it is basically either Tunnels of Libtorrent wrapper (triggers even w/o bootstrap downlad and any other downloads, though.)
Trying to also debug the new RC1 on Ubuntu. She is holding steady for a while already, no crash yet.
What is this?
ERROR:TrustChainCommunity:Not sending crawl response, the block is invalid. Result (<function ValidationResult.invalid at 0x7f3d4b8c5510>, ['Total down is higher than expected compared to the next block'])
Some chatter that should be fixed:
ERROR:DiscoveryCommunity:Exception occurred while handling packet!
Traceback (most recent call last):
File "ipv8/messaging/serialization.py", line 314, in unpack_multiple
File "ipv8/messaging/serialization.py", line 291, in unpack
File "ipv8/messaging/serialization.py", line 155, in unpack_from
struct.error: unpack_from requires a buffer of at least 80 bytes
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "ipv8/messaging/serialization.py", line 341, in unpack_to_serializables
File "ipv8/messaging/serialization.py", line 321, in unpack_multiple
ipv8.messaging.serialization.PackError: Could not pack item 1: varlenHx20
error: unpack_from requires a buffer of at least 80 bytes
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "ipv8/community.py", line 331, in on_packet
File "ipv8/lazy_community.py", line 33, in wrapper
File "ipv8/messaging/serialization.py", line 360, in ez_unpack_serializables
File "ipv8/messaging/serialization.py", line 344, in unpack_to_serializables
ipv8.messaging.serialization.PackError: Failed to unserialize SimilarityResponsePayload
PackError: Could not pack item 1: varlenHx20
error: unpack_from requires a buffer of at least 80 bytes
[SNIP]
Traceback (most recent call last):
File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-core/tribler_core/modules/libtorrent/torrentdef.py", line 75, in __init__
RuntimeError: incorrect number of piece hashes in torrent file
[SNIP]
Traceback (most recent call last):
File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-core/tribler_core/modules/libtorrent/download_manager.py", line 760, in load_checkpoint
File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-core/tribler_core/modules/libtorrent/torrentdef.py", line 136, in load_from_dict
File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-core/tribler_core/modules/libtorrent/torrentdef.py", line 77, in __init__
ValueError: incorrect number of piece hashes in torrent file
ERROR:Session:bitcoinlib library cannot be loaded: No module named 'bitcoinlib'
@synctext Most of these errors are because someone is sending you bad packets (see https://github.com/Tribler/py-ipv8/issues/701). Not sure about the Trustchain error.
No refresh after adding dozens of new files to "My Channel" (you have to wait :-)
@synctext the ipv8.messaging.serialization.PackError: Failed to unserialize SimilarityResponsePayload
error is from the kotlin implementation, we already informed @MattSkala 😉
No refresh after adding dozens of new files to "My Channel" (you have to wait :-)
Channel size counters only updated on channel commits, which happen automatically (if you do not disable this feature in Settings). The added entries will be visible in the channel immediately, though.
I am at loss dealing with this. Naturally, this is caused by something in Tunnels or Libtorrent. However, sometimes it will not trigger at all and there are no real debug messages.
Stable for a 3+ hours already, but not doing anon downloads. Normal downloads are OK.
@ichorid @egbertbouman @devos50 Please coordinate and try to get something like the application stress tester operational to get a good reproducible crash of this. It might needs some repeated downloading. RC1 just crashed on me after 18 hours. No anonymous downloads, just 1 normal download active.
INFO:CreditMiningManager:0 active download(s), 0.000 MB uploaded, 0.000 MB downloaded
INFO:DownloadManager:Failed to retrieve metainfo for 0508c7bb87f531cbfc7ecac1f0ac608eb98d7e96
INFO:DownloadManager:Failed to retrieve metainfo for b03104cc6c5d73ccab0b92b27854d5ec011e7147
INFO:DownloadManager:Failed to retrieve metainfo for b4f0da9cafc13949a9e3ebdd53b8973a772b21fd
INFO:GigaChannelManager:Downloading new channel version 9b10b5fb1391f9d0da0bdbab2a4c82fd50b9f402c9cbdbcdc0552dd8ce3910005ea7de7282851fa932d939197ec88ddc463acbaf73b46574cf4afa432964c7f7 ver 1562671495055->1562671495058
INFO:GigaChannelManager:Downloading new channel version 02ef6751e809bc49c92540962fd8fa54cf89b0004c61bfa1d31b5095ee20127a0bda166852057e15cdf57e3c315b88e67fddaa4d4e41b34b9415163d6062ff00 ver 1562497274819->1583447000002
INFO:TriblerTunnelCommunity:Want 0 data circuits of length 1
INFO:DownloadManager:Trying to fetch metainfo for 0508c7bb87f531cbfc7ecac1f0ac608eb98d7e96
INFO:DownloadManager:Trying to fetch metainfo for b03104cc6c5d73ccab0b92b27854d5ec011e7147
INFO:PopularityCommunity:Received torrent health information for 5 random torrents and 5 checked torrents
ERROR:root:Traceback (most recent call last):
File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-gui/tribler_gui/event_request_manager.py", line 114, in on_read_data
RuntimeError: Task was destroyed but it is pending!
INFO:TorrentChecker:Selected 2 new torrents to check on tracker: http://tracker.zelka.org/announce.php
INFO:Session:Tribler shutdown state notification:Shutting down Credit Mining...
INFO:CreditMiningManager:Shutting down CreditMiningManager
INFO:Session:Tribler shutdown state notification:Shutting down Torrent Checker...
INFO:Session:Tribler shutdown state notification:Shutting down Gigachannel Manager...
INFO:DownloadManager:Failed to retrieve metainfo for 0508c7bb87f531cbfc7ecac1f0ac608eb98d7e96
INFO:DownloadManager:Failed to retrieve metainfo for b03104cc6c5d73ccab0b92b27854d5ec011e7147
INFO:Session:Tribler shutdown state notification:Shutting down Version Checker...
INFO:Session:Tribler shutdown state notification:Shutting down Resource Monitor...
INFO:Session:Tribler shutdown state notification:Unloading Tunnel Community...
INFO:Socks5Connection:Closing session, reason stopping
INFO:Socks5Connection:Closing session, reason unspecified
INFO:TriblerTunnelCommunity:Destroy_exit_socket 2609793190 ('81.171.8.17', 35125)
INFO:TriblerTunnelCommunity:Removing DATA circuit 992103458 unload
INFO:TriblerTunnelCommunity:destroy_circuit 992103458 ('5.79.71.187', 35035)
INFO:TriblerTunnelCommunity:Removing DATA circuit 4170702521 unload
INFO:TriblerTunnelCommunity:destroy_circuit 4170702521 ('5.79.71.187', 35035)
INFO:TriblerTunnelCommunity:Removing DATA circuit 245290361 unload
INFO:TriblerTunnelCommunity:destroy_circuit 245290361 ('81.171.8.19', 35020)
INFO:TriblerTunnelCommunity:Removing DATA circuit 372189226 unload
INFO:TriblerTunnelCommunity:destroy_circuit 372189226 ('81.171.8.17', 35005)
INFO:Session:Tribler shutdown state notification:Shutting down TrustChain Community...
INFO:Session:Tribler shutdown state notification:Shutting down IPv8...
INFO:Session:Tribler shutdown state notification:Saving configuration...
INFO:Session:Tribler shutdown state notification:Checkpointing Downloads...
ERROR:Download:Resume data failed to save: <class 'tribler_core.exceptions.SaveResumeDataError'>: torrent has no metadata
ERROR:Download:Resume data failed to save: <class 'tribler_core.exceptions.SaveResumeDataError'>: torrent has no metadata
INFO:Session:Tribler shutdown state notification:Shutting down Downloads...
INFO:Session:Tribler shutdown state notification:Shutting down Libtorrent Manager...
INFO:Session:Tribler shutdown state notification:Waiting for Libtorrent to finish...
INFO:Session:Tribler shutdown state notification:Shutting down Metadata Store...
INFO:Session:Tribler shutdown state notification:Shutting down API Manager...
INFO:root:Shutting down Tribler
@synctext , I spent the last three days debuggin this. A stress tester would not help with debugging but could have helped with catching this earlier in development. This is the worst kind of bug. There is almost no info no what causes it, and the crash is entirely stochastic.
Ai. What is your next-steps strategy? Could we enhance our application tester to output more debug info and give us some hints?
btw Crashed again just after a call to PopularityCommunity
for a second time.
INFO:PopularityCommunity:Received torrent health information for 5 random torrents and 5 checked torrents
INFO:CreditMiningManager:Downloading: 0, Uploading: 0, Stopped: 0
INFO:CreditMiningManager:0 active download(s), 0.000 MB uploaded, 0.000 MB downloaded
INFO:TriblerTunnelCommunity:We joined circuit 3007315007 with neighbour ('81.171.27.193', 35030)
INFO:PopularityCommunity:Received torrent health information for 5 random torrents and 5 checked torrents
ERROR:root:Traceback (most recent call last):
File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-gui/tribler_gui/event_request_manager.py", line 114, in on_read_data
RuntimeError: Task was destroyed but it is pending!
If you enable asyncio debug, we should get more information about this unfinished task..
If you enable asyncio debug, we should get more information about this unfinished task..
Unfortunately, we will not. I got, like, three dozens crashes with asyncio debug enabled. Nothing suspicious. :disappointed:
INFO:TriblerTunnelCommunity:Adding first hop 81.171.8.17:35180 to circuit 1231071863
INFO:CreditMiningManager:Downloading: 0, Uploading: 0, Stopped: 0
INFO:CreditMiningManager:0 active download(s), 0.000 MB uploaded, 0.000 MB downloaded
ERROR:root:Traceback (most recent call last):
File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-gui/tribler_gui/event_request_manager.py", line 114, in on_read_data
RuntimeError: Task was destroyed but it is pending!
Printing full error context yields something useful (formatted manually for readability):
({'message': 'Task was destroyed but it is pending!',
'task':
<Task pending coro=<TunnelCommunity.on_create()
running at /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py:725>
wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f9ca46d4590>()]
created at /tribler_src/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py: 206>
cb=[TaskManager.register_task.<locals>.done_cb() at /tribler_src/src/pyipv8/ipv8/taskmanager.py:105]
created at /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py:675>,
'source_traceback': [
<FrameSummary file /tribler_src/src/run_tribler.py, line 117 in <module>>,
<FrameSummary file /tribler_src//src/run_tribler.py, line 101 in start_tribler_core>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py, line 534 in run_forever>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py, line 1763 in _run_once>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/events.py, line 88 in _run>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/selector_events.py, line 965 in _read_ready>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/interfaces/udp/endpoint.py, line 28 in datagram_received>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/interfaces/endpoint.py, line 85 in notify_listeners>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/interfaces/endpoint.py, line 73 in _deliver_later>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/community.py, line 331 in on_packet>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py, line 660 in on_cell>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py, line 675 in on_packet_from_circuit>]}')
Flip flopping between subclasses in the stacktrace, it seems that this future:
Is added to this request cache:
But also returned:
This probably leads to a conflict somewhere.
Fun fact: should_join_circuit
is async
in base AnonymizationCommunity, though it is not necessary. But TriblerTunnel Community overloads it with non-async version.
So, on a CreatePayload, on_create
gets fired, which then runs should_join_circuit
and await
s for balance_future
created there. Also, on_create
is a Task
managed by the TaskManager
aspect of TriblerTunnel Community.
What probably then happens is, something cancels the Task while it is still waiting on the Future.
@ichorid Yes, should_join_circuit
in the base class is async, in TriblerTunnel it's not and returns a Future. However, you both call them using await should_join_circuit
. I don't really see the problem.
I think the issue that's causing the stacktrace is that BalanceRequestCache.on_timeout
should be setting the result of the Future.
@egbertbouman , but is there a guarantee that BalanceRequestCache
will time out before the anonymous Task serving on_create
? If it will not, the same problem will reappear again.
If I insta-pop the RequestCache I get a different task destroyed error, but this one is caught and logged (no crash!):
[PID:5282] 2020-03-21 17:00:08,818 - ERROR <base_events:1604> asyncio.default_exception_handler(): Unhandled error in exception handler
context:
{'message': 'Task was destroyed but it is pending!',
'task':
<Task pending coro=<TunnelCommunity.on_create() done,
defined at /home/quinten/Documents/tribler/src/pyipv8/ipv8/messaging/anonymization/community.py:711>
wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f41081cc490>()]>
cb=[TaskManager.register_task.<locals>.done_cb() at /home/quinten/Documents/tribler/src/pyipv8/ipv8/taskmanager.py:105]>}
Traceback (most recent call last):
File "/usr/lib/python3.7/asyncio/base_events.py", line 1639, in call_exception_handler
self._exception_handler(self, context)
File "/home/quinten/Documents/tribler/src/tribler-core/tribler_core/session.py", line 379, in unhandled_error_observer
self.api_manager.get_endpoint('events').on_tribler_exception(text_long)
File "/home/quinten/Documents/tribler/src/tribler-core/tribler_core/restapi/events_endpoint.py", line 115, in on_tribler_exception
self.write_data({"type": NTFY.TRIBLER_EXCEPTION.value, "event": {"text": exception_text}})
File "/home/quinten/Documents/tribler/src/pyipv8/ipv8/taskmanager.py", line 35, in wrapper
ensure_future(func(self, *args, **kwargs)),
File "/usr/lib/python3.7/asyncio/tasks.py", line 607, in ensure_future
loop = events.get_event_loop()
File "/usr/lib/python3.7/asyncio/events.py", line 644, in get_event_loop
% threading.current_thread().name)
RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-0_1'.
/usr/lib/python3.7/asyncio/base_events.py:1655: RuntimeWarning: coroutine 'EventsEndpoint.write_data' was never awaited
exc_info=True)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
@ichorid The anyonmous Task serving on_create will never timeout.
@qstokkink You removed the RequestCache, which results in the Future destructor getting called, which results in this crash. Again, I think you just need to set the Future before it gets destroyed.
@egbertbouman sure, but I want to confirm we're fixing the right thing. We have no way to reproduce it now.
This makes a hard-crash happen very quickly for me:
diff --git a/src/pyipv8 b/src/pyipv8
index d57f610d6..775b5a40b 160000
--- a/src/pyipv8
+++ b/src/pyipv8
@@ -1 +1 @@
-Subproject commit d57f610d60a45f30535659a2741d2862881a875a
+Subproject commit 775b5a40b8d7b4315203430b9569b74731dfd7a8
diff --git a/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py b/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py
index 6e4d2f650..442f2eadc 100644
--- a/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py
+++ b/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py
@@ -190,17 +190,17 @@ class TriblerTunnelCommunity(HiddenTunnelCommunity):
"""
Check whether we should join a circuit. Returns a future that fires with a boolean.
"""
- if self.settings.max_joined_circuits <= len(self.relay_from_to) + len(self.exit_sockets):
+ """if self.settings.max_joined_circuits <= len(self.relay_from_to) + len(self.exit_sockets):
self.logger.warning("too many relays (%d)", (len(self.relay_from_to) + len(self.exit_sockets)))
- return succeed(False)
+ return succeed(False)"""
# Check whether we have a random open slot, if so, allocate this to this request.
circuit_id = create_payload.circuit_id
- for index, slot in enumerate(self.random_slots):
+ """for index, slot in enumerate(self.random_slots):
if not slot:
self.random_slots[index] = circuit_id
- return succeed(True)
-
+ return succeed(True)"""
+ print("*"*50)
# No random slots but this user might be allocated a competing slot.
# Next, we request the token balance of the circuit initiator.
balance_future = Future()
@@ -217,6 +217,8 @@ class TriblerTunnelCommunity(HiddenTunnelCommunity):
self.directions.pop(circuit_id, None)
self.relay_session_keys.pop(circuit_id, None)
+ self.request_cache.pop(u"balance-request", circuit_id)
+
return balance_future
async def on_payout_block(self, source_address, data):
@egbertbouman I believe my patch
file does indeed confirm that you are right and the Future
destructor is the issue.
@qstokkink , it crashes for me too almost instantly. This is indeed a good way to (hopefully) test the thing.
This is exactly the kind of situation trio
and cancellation scopes were designed to solve... :man_facepalming:
To prod in an open wound.. Guess in an more ideal Tribler world we would push code and see it broke the overnight application testers. First, a manual revert would be issued. Second, create a test which crashes it quicker by magic. Third, we would have a clear idea which PR broke stability. Finally fix is committed. Do we want to start defining and enforcing this more? Or we keep these seemingly frustrating days of bug hunting in order to maximize roadmap progress?
@synctext , this bug was introduced during transition from Twisted to Asyncio due to subtle differences in the frameworks' semantics. Also, overnight testers would probably not catch this, as this requires some time to trigger, and does not trigger at all at most machines. Of course, we would like to get more effective deployment testing. However, this kind of bugs is really, really hard to hunt down unless you collect full telemetry from users.
Proper deployment testing is no silver bullet. Some problems can only be kept in check by adopting a better-suited programming style (think threads vs reactor, spaghetti vs structured etc.), doing frequent experimental releases and producing more useful stack reports.
The proper way to handle this is to add a callback to set off the balance_future
associated with BalanceRequestCache
on cancellation of the associated Task, and not just setting _on_timeout
as @egbertbouman suggested. This will guarantee that whatever the reason for cancelling the task (e.g. Tribler shutdown), balance_future
will be triggered.
However, with the current implementation of NumberCache
it is impossible. NumberCache
is not inherted from asyncio.Task
and instead is just a regular Object
. The proper way to handle cache-like objects is to refactor NumberCache
to inherit from asyncio.Task
implementing its cancel()
method. That's what cancel
method is for. :see_no_evil:
That would be a pretty big refactoring. For 7.5. I'll try to stick to @egbertbouman 's solution and hope it works.
I've applied @egbertbouman's fix and now testing the thing. If I don't see anything suspicious in a day or so, I'll file a PR.
Our RC1 crashes on Linux within an hour. I think that this should not be possible to go undetected in our process.
Anyways, #4999 does not need to be done in a hurry.
Our RC1 crashes on Linux within an hour. I think that this should not be possible to go undetected in our process.
I completely agree. However, the funny thing is that all this time no one paid any attention to this particular crash due to Tribler crashing constantly because of Libtorrent 1.2.x transition, Python3 transition, Asyncio transition, GUI refactorings... Well, you get it. One can only check regressions on a thing that is already stable. "Garbage in, garbage out" as they say. Otherwise, tests become merely a nuisance that developers tend to ignore.
We're actively working on getting Tribler into a shape where it will actually become testable.
Being patient for 14 years and 10 months is not always easy, but experimental release 4 and RC1 has solid performance and responsiveness! btw did we switch anon relaying protocol? I'm not earning anything anymore, not part of any relay circuit.
This happens both on Linux (witnessed myself many times) and on Windows (reported in 7.5.0-rc1). Just run Tribler, do nothing (or do anything), and after about 5 minutes it crashes.
Observations so far:
WeakValueDictionary
. As when I change it back todict
it happens still.DEBUG
level of logging makes it go away (or I was just really lucky to not trigger in several dozens runs with debug enabled).run_tribler_headless
script.