Closed Rik44444 closed 4 years ago
Another detail: when just adding a download, for a brief moment it will show "1(0)" or "30(0)" peers before showing 0 and getting stuck permanently.
I just tried to see if disabling the download dialog also fixes the problem I'm having, but it doesn't. If I don't use force_dht_announce
I consistently get a delay of a few minutes (using 0 hops).
One of the downloads got unstuck after a few minutes. Though the download sees only a single peer and is thus very slow. The torrent is a healthy one, it got thousands of peers (I use it regularly for testing).
@arvidn , what could make Libtorrent ignore other peers and only use one?
Here are some session stats (for this run I disabled Libtorrent logging, unfortunately):
EDIT: while I was typing this, another peer was added to the download:
EDIT2: while watchin the thing download I noticed something suspicious: all the clients that were connectied were either Libtorrent based, or 0% completed, no speed, no client-identifier string. After the download stopped, the list of peers looked like this:
Also, while downloading, I started a virtual machine (bridged) with another Tribler instance and started the same download. It started normally, lots of peers, etc. The 1st Tribler instance was immediately able to see the 2nd one and started to leech from it full-speed.
EDIT3: After a few hours the peers list still looks strange:
Some more Libtorrent stats after an hour of seeding stuck:
uTP stats look strange to me :thinking:
I started a Wireshark session capturing the exchange of packets between the stuck Tribler instance and the external world. One of the uTP sessions captured:
@arvidn , is it normal for Libtorrent to regularly send to other peers Timestamp Difference of ff ff f3 db
microseconds? If it could help, I can send you the whole capture.
Another conversation. Does it look normal?
Changed anonymity to 1 hop on one of the downloads, it started to download:
The list of peers looks normal too.
So, the problem is either local to Libtorrent session, or have something to do with NAT traversal.
Kept Tribler running overnight, all downloads finished, newly added torrents download fine. :man_shrugging:
Stats after the night:
is it normal for Libtorrent to regularly send to other peers Timestamp Difference of ff ff f3 db microseconds?
Yes, the timestamp difference is the difference between the sender's and the receiver's clocks. They aren't synchronized so the difference is mostly a non-sensical value. However, it's very useful to monitor changes over time. The assumption in uTP's congestion control is that the clocks won't drift too much, so the change in this diff represents changes in buffering along the network path.
Regarding the libtorrent stats, some of those numbers are "counters", meaning they are cumulative and monotonic. Some of them are "gauges" meaning they are instantaneous snapshots of the current state (say, the number of uTP connections in a certain state, right now). The counters are mostly useful when collecting multiple samples and comparing them over time.
@arvidn , would you be able to figure out what's happening if I send you a Wireshark capture of Tribler stuck trying to download the torrents? Also, what could have caused the behaviour on the screenshots above, namely Tribler connecting only to the 0% nameless clients?
I could give it a shot. It’s all direct peer connections, not encrypted nor over a proxy, right?
I wrote this tool a while back which might be useful: https://github.com/arvidn/bittorrent-trace
@arvidn , I appliet your bittorrent-trace
utility to the exchange file of the stuck Tribler instance.
Aside from other peers requesting metainfo for test torrents I used to run previously, there a dir with the relevant (ubuntu) infohash.
Aside from a single exchange, all captures looked like this:
>> 1591212017.149 HANDSHAKE
>> 1591212017.149 RESERVED 0000000000180005
>> 1591212017.149 INFO-HASH ee55335f2acde309fa645fab11c04750d7e45fa1
>> 1591212017.149 PEER-ID 0920dc71b767d3bfe956e469b5a144b712079088 [. .q.g...V.i..D.....]
<< 1591212017.624 FIN
>> 1591212017.624 FIN
The single non-trivial exchange was:
In the non-trivial exchange, you send a "HAVE-ALL" message. Then you get a "lazy bitfield", where it turns out the peer is a seed as well, and the connection is closed.
There's also a 'upload_only': 1,
in the handshake, suggesting you aren't interested in downloading anything. If you're not a seed, you can end up in upload-only mode in case a disk write operation fails with ENOSPACE
. i.e. if you run out of space, the torrent will be switched into "upload-only" mode. So, it will keep uploading but won't download any more.
In this case the torrent is not paused not set in an error state, but there should be a file_error_alert
.
I take it you're not a seed, is that right? Is it possible the torrent was started in seed-mode, and nothing has caused it to check any piece yet, so it keeps thinking it's a seed? I can't think of any other reason why a HAVE-ALL
message would be sent.
seed-mode would also explain upload_only
in the handshake too.
@arvidn , the captured exchange was from the moment when the ubuntu torrent was already downloaded (although very slow), but the two remaining downloads were still stuck. So, upload only is fine.
What's suspicious, though, is that there are no attempts to get metainfo for the remaining two downloads from DHT. Do you protocol parser catches DHT connection attempts?
It doesn't analyze DHT packets at all. But any actual attempt to establish a TCP or uTP connection to a peer should be picked up
I've compared a healthy capture vs the stuck one. One thing is for sure: when stuck, Tribler sends no DHT requests for the torrents that are stuck in "downloading metadata" state. Instead, it sends DHT requests for some other infohashes. I tried looking up those infohashes, but DHT can't find them.
However, while looking at those infohashes I noticed something strange. There are streaks of infohash requests that all start with the same few bits. Here are the screenshots:
These were all going to another instance of Tribler running in the same local network (my seedbox). Though, there too were some DHT requests going to external servers with the same prefix, so it is not specific to local network Tribler->Tribler DHT activity. The infohashes queried look like this:
03d0f1213f07db9ae14f4933b9cb33528db94513
03d0f294815d38d14487762cb2ebea87ca9843c4
03d0f61269d972c2e27dee834431d82193815e01
03d0f2fbfeacb50e8be80d232a8e56026db8cab0
03d0f5ee78afc9fecad396af3e047758a9e878dd
03d0f1d087595241e85755a0f6932e2fe229d973
03d0f54737d9d932bde8af0f380b65dc96b3dca3
03d0f3f87902892fabecdc7b5388e02d7ee65847
@arvidn , I'm no specialist in DHTs, but to me, it looks very suspicious. It looks more like DHT response than DHT request.
Ah, I see from the source code! Libtorrent uses DHT obfuscation, it will only start using the real DHT address when it gets closer to the target. So this is not a bug, but a feature :grin:
Still, this leaves us with a problem: the capture contained no queries for infohashes of the stuck torrents. 03d0f
does not correlate with the beginning of any of the stuck torrents' infohashes.
@arvidn , I still can't understand, why it searches for the same prefix as the sender (our own) id
?
Still, this leaves us with a problem: the capture contained no queries for infohashes of the stuck torrents. 03d0f does not correlate with the beginning of any of the stuck torrents' infohashes.
Right, assuming your capture is longer than 15 minutes (which is the default DHT announce interval) there should have been at least one DHT lookup for the info-hash.
There are a few reasons a torrent isn't announcing to the DHT. It can be detected whether it is by looking at torrent_status::announcing_to_dht
(docs)
Queuing logic may disable announcing to DHT (docs)
@arvidn , I still can't understand, why it searches for the same prefix as the sender (our own) id?
That's part of normal DHT routing table maintanence, to make sure it discovers new DHT nodes that are close to its own ID.
My capture was only 3 minutes long, which makes it useless for analysis purposes...
Queuing logic may disable announcing to DHT
We don't use the "automanage" feature in Tribler. Does DHT-limiting queuing logic still applies in this case?
Well, I'll try to reproduce it once again (though it depends on pure chance :four_leaf_clover: ). When I do, I'll do a long packet capture.
Just reproduced it again!
To semi-reliably reproduce it, one must start Tribler without any subscribed channels (or with 1-2 channels), subscribe to top 10-20 channels, and then, while the channels are downloading, start trying downloading other highly seeded torrents by infohash. In my case, the 1st Ubuntu torrent started fine, so I removed it and tried the second one. The second got stuck as the last time.
I'll do a long capture of UDP packets to look at DHT activity.
I think dumping torrent_status
might be useful as well
I managed to reproduce it semi-reliably. When the problem arises, it affects all the torrents I try to download. In a normal session, the added torrent will find a lot of peers almost immediately (10-15 seconds max). In a bugged session, the torrent is stuck for 3-6 minutes in "waiting for metadata". After it downloads the metadata, it will connect no more than 1-2 peers and proceed to leech from them. This can continue for 2-10 minutes. After that, the torrent almost instantly finds a lot of peers and leeches very fast as normal. Adding other torrents during the same run results in the same scenario, no matter the torrent.
I started to print torrent_status
as suggested. Here is a sample from the moment when the torrent is still stuck waiting for metadata:
And here is the one from the moment where it already leeches from 2 peers:
Note there is "queue_position": 4
for the stuck state. We do not use torrent automanagement, but the same session runs a dozen of background DHT checks for Channel torrents and there are two dozens of tiny Channel torrents already downloaded and seeded.
@arvidn , does Libtorrent uses some kind of queueing even in case automanagement is not used.
We do not touch the auto_managed
flag at all.
Libtorrent uses automanagement by default :face_with_head_bandage:
In [1]: import libtorrent as lt
In [2]: lt.add_torrent_params_flags_t.flag_auto_managed & lt.add_torrent_params_flags_t.default_flags
Out[2]: 32
And we never touch that. @egbertbouman , do we disable automanage explicitly?
EDIT:
I figured out we construct atp
flags ourselves and pass theme to async_add_torrent
. If we do not add a flag for automanagement, the torrent should start without one, right?
Why the queue:4
then?
I added a bunch of torrents simultaneously and printed queue_position
and flags
for them.
flags
is 0
everywhere. However, queue_position
is -1 for those already seeding and 1 and more for the others.
Either setting auto_managed
flag does not work through Python, or Libtorrent assigns queue_position
even if the torrent is not managed (and that would be very confusing).
@arvidn ?
Having read a little bit through Libtorrent source I can say that it indeed uses queue_position
internally when ordering/prioritising connections even when there are no auto-managed torrents. If there is some kind of internal limit for DHT sessions and/or external peer connections or some rare condition that can result in a bunch of torrents being constantly scheduled to the front of the queue,
that would explain the problem. One thing that could be related is that we download a lot of small (~ 1kb size) torrents in the same session (Channels subsystem torrents). That could trigger some kind of "endgame" mode that will force the things to get priority over the recently added torrents.
yeah, queue position is always assigned, regardless of whether it's auto managed or not. it's possible that the dht_limit
would affect torrents that don't have the auto manged flag set (but it seems like it shouldn't). Stil though, the torrent_status
you posted suggests that the DHT is still enabled for the torrent, so it might be some other issue, where it's not announced to the DHT for some reason.
I noticed that stuck torrents are announced to DHT after about 300 seconds. auto_scrape_min_interval
is exactly 300 seconds.
If a torrent is not auto-managed, how often is it announced to DHT? I mean, if I add a torrent by infohash, what's the schedule of DHT checks for it? Does adding a non-auto-managed torrent interfere in some way with DHT scrapes schedule of other non-auto-managed torrents?
The convention is to announce a torrent to the DHT every 15 minutes. The auto-managed mechanism only turns things on and off. It doesn't affect the DHT announce frequency.
It seems it has something to do with C-level async stuff. I can semi-reliably trigger it by adding 3-4 torrents in a couple of seconds just after starting Tribler (with Channels torrents restoring in the same session). One of the added torrents will be almost guaranteed to become stuck in "downloading metadata" for 5-??? minutes.
However, if I change the actual command that Tribler uses to add torrents from async_add_torrent
to the plain add_torrent
, I can't reproduce the problem.
@arvidn , could it be that the logic behind prioritize_dht
can't handle bursts of dozens of asynchronously added torrents?
One reason I can think of this would happen is that if you add the torrents while the DHT is still being bootstrapped, it's possible that the DHT announces for the torrents will fail, because the routing table is empty, or virtually empty.
There's (currently) no logic to detect whether a DHT announce fails and retry again soon. The only re-announce mechanism is the periodic one, every 15 minutes. So, if you wait 15 minutes, do the stuck torrents get unstuck?
So, if you wait 15 minutes, do the stuck torrents get unstuck?
Yes, they do. Sometimes it happens after 10 minutes, sometimes it takes more than an hour (i.e. after adding 3 torrents, one torrent can get unstuck in 15 minutes, the remaining two can be still stuck for hours. ). Waiting for hours happens rarely, though.
Yep, it definitely only happens when using async_add_torrent
. The torrents get stuck for a few minutes, then slowly get unstuck.
When using add_torrent
directly though, sometimes a very interesting thing happens: Tribler downloads endpoint becomes completely stuck, all downloads in the "waiting for metadata" state. The console shows that the usual requests to it do not trigger anything. Then, there come HTTP timeouts... Wireshark capture shows normal uTP activity though, for existing torrents.
After 5-10 minutes, everything becomes unstuck. Tribler instantly shows all downloads as having proper metadata, just started downloading.
@arvidn , is there something that could block in add_torrent
for several minutes, though does not affect already existing (seeding) torrents?
does the call itself, to add_torrent()
block during this time? (it would be surprising if it would block for that long, it would suggest that something fishy is going on with the main network thread in libtorrent). It's only supposed to block until the main thread has received the request, added the torrent and produced a torrent_handle
that's then returned. So, the call should take longer than async_add_torrent()
.
On the libtorrent side, the exact same thing happens in both cases, the only difference is that the non-async version waits on the client side, blocking the calling thread.
If you see a difference in behavior between the two, I would think it's something triggered by timing.
Does this problem seem to be caused by torrents not announcing to the DHT? (or at least not announcing in a timely manner).
Does Tribler change any of the dht_
settings?
In order to understand more of what's going on, could you try changing the dht_announce_interval
setting to something very low, to make torrents announce to the DHT more frequently?
By default it's set to (15 * 60) seconds.
This is the function that's called when it's time to announce another torrent to the DHT.
It tries to stagger the DHT announce of all torrents evenly distributed across the 15 minutes, by setting up a timer to fire every (15m / num_torrents).
int delay = std::max(m_settings.get_int(settings_pack::dht_announce_interval)
/ std::max(int(m_torrents.size()), 1), 1);
There's a risk that if this is called when there's only 1 torrent, the timer is set up for 15 minutes, then if you add more torrents, maybe the delay won't be recomputed for another 15 minutes (I don't know that this is the case, but it seems like a risk).
The next torrent to announce to the DHT is determined by an iterator m_next_dht_torrent
, which rotates over all torrents.
Using add_torrent
nonetheless results in stuck torrents, although with smaller probability.
For instance, I added 9 highly popular torrents. 6 of them proceeded from "waiting for metadata" to "downloading" almost immediately.
3 got stuck (lets call them A,B,C).
Then, I waited for a couple of minutes just to be sure. Then I stopped the 6 downloading torrents.
The 3 ABC stuck torrents remained stuck.
Then I waited for another couple of minutes.
Then I stopped one of the stuck torrents (A), and started it right away.
It got unstuck immediately.
The 2 remaining stuck torrents (B and C) were still stuck.
After the A finished downloading, B got unstuck and started to download.
After B finished downloading, C remains stuck for more than 40 minutes...
EDIT: C was stuck for a good reason: there are no peers. Checked with another client.
getting unstuck when you stop and start is consistent with the first DHT announce failing. stopping and starting will trigger another DHT announce right away.
Starting immediately as another torrent finishes suggest that there might be some auto-managed aspect, that it was paused and got unpaused.
the torrent_status you posted earlier suggests that the torrent has not found any peers, anywhere:
'list_seeds': 0
'list_peers': 0
'connect_candidates': 0
does the call itself, to add_torrent() block during this time? I tried to simulate the blocking with main thread
sleep
for both adding and reporting torrents. It didn't resemble the behaviour I described above at all. So, if there was a problem, it was more akin to missing handles or something like that. A mystery, indeed! :ghost: :man_shrugging: :ghost:
Tribler does not alter any dht_
setting of Libtorrent.
To sum it up, for now, the problem seems to be that adding a bunch of torrents in a short period of time results in some torrents being stuck in "waiting for metadata" for 300-600 seconds. Stopping any of the other torrents seems not to affect the stuck torrent. While restarting the stuck torrent sometimes can make it unstuck.
Another trial: I disabled the network connection and started a bunch of torrents, then closed Tribler. Then I enabled the network connection back and opened Tribler. Out of the 7 added torrents only a single one started, the others got stuck in such a bad way, that even stopping-starting them did not help. Then I restarted Tribler again and voila, every torrent got started as normal. I tried to reproduce the same trick with network connectivity, a couple of times, but everything worked fine.
As you say, the problem is highly timing-dependent.
you can see which DHT announces are in-flight by calling torrent_handle::post_dht_stats()
. It will post a dht_stats_alert
which has active_requests
and routing_table
fields. It's not the full table, just counters for number of nodes per bucket.
The active_requests
will show both get_peer
lookups as part of announcing torrents, as well as bootstrap
when starting up, and other get_peer
lookups as part of normal routing table maintenance.
I'll try to get more DHT stats using this calls later. Meanwhile, I traced the faulty DHT exchange by hand. The pattern is as follows: Libtorrent asks a DHT peer for some non-stuck infohash (N), the peer responds with a proper DHT answer packet. After 2-15 seconds Libtorrent asks the same peer for the stuck infohash (S). This time, the peer remains silent.
This happened 8 times during the first 32 seconds of the Libtorrent session.
I waited for a few minutes, restarted the stuck infohash (S) and it was able to instantly get the necessary DHT responses from other peers (e.g. peers not including those first 9 peers that became silent).
It looks like we're hitting into some kind of DHT requests rate-limiting mechanism employed by other BitTorrent clients. :bricks:
I got a full capture saved, BTW.
@arvidn , is it possible to wait upon a Libtorrent start until you get a certain number of peers? If the problem is really about DHT flooding control, the easiest solution for us would be to just wait a few seconds until we get enough peers.
there is an alert for when DHT bootstrapping completes. It might be a good idea to wait for that before starting any torrent. Maybe the simplest solution would be to pause the session and then resume it once bootstrapping completes.
I merged the fix for the DHT flood problem.
@arvidn , thank you very much for your help! Hope you got something out of this too.
Is it time almost for RC6 or more @egbertbouman stuff coming?
We're waiting for Egbert. His stuff is important. He will probably finish today or tomorrow.
@synctext When https://github.com/Tribler/tribler/pull/5357 is merged we should be good to go. More stuff is still coming, but that will have to wait until after RC6.
hi all, i have an issue with Tribler: all my downloads regularly regress to 0 kb/sec. They pick up again after a while. I've tried to create logs for this to share with the developers, but even with the log option checked in the options, my log folder stays empty. What did i miss? I'm on Windows 10, using latest stable version 7.4.4 from the Tribler.org website. Rik