Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.76k stars 443 forks source link

all downloads regularly go to 0kb/sec #5319

Closed Rik44444 closed 4 years ago

Rik44444 commented 4 years ago

hi all, i have an issue with Tribler: all my downloads regularly regress to 0 kb/sec. They pick up again after a while. I've tried to create logs for this to share with the developers, but even with the log option checked in the options, my log folder stays empty. What did i miss? I'm on Windows 10, using latest stable version 7.4.4 from the Tribler.org website. Rik

devos50 commented 4 years ago

Thanks for the report! Does this issue still persist?

Rik44444 commented 4 years ago

jep still happening. I'll be happy to upload an error log, as soon as you can help me figure out how to make one :P

aaguilarjimenez commented 4 years ago

This has started to happen to me recently too. I was on 7.4.4 and upgraded to 7.5.0-rc5, still same behavior. It stays at 0 Kb/s for a long time, hours even, then maybe after a while it finally resumes downloading, pretty fast. Sometimes closing the program and opening it up again solves it, at least momentarily.

I just upgraded my Ubuntu MATE from 19.10 to 20.04. I think it coincided with the moment in which the problem started to happen, but cannot confirm it at 100%.

Also, I remember to have had this same problem last year, but then stopped. I don't remember how or why it stopped. Possibly due to updating my Ubuntu or Tribler. I was on Ubuntu MATE 18.04 back then, with whichever was Tribler's latest version.

synctext commented 4 years ago

interesting.. Seems dupe of #5307

synctext commented 4 years ago

Seems Windows 10, Mate, and Mac users have seen this performance bug. We should try to catch this in our app stress tester infrastructure.

ichorid commented 4 years ago

Deluge users reported similiar problem from time to time. Apparently, Libtorrent blocking on disk i/o is the cause of this.

ichorid commented 4 years ago

@aaguilarjimenez , @Rik44444 , please indicate does it happen only with anonymous downloads or with non-anonymous too?

Rik44444 commented 4 years ago

@aaguilarjimenez , @Rik44444 , please indicate does it happen only with anonymous downloads or with non-anonymous too?

sorry i've always only tried anonymous. I'll try non-anonymous next time

aaguilarjimenez commented 4 years ago

@aaguilarjimenez , @Rik44444 , please indicate does it happen only with anonymous downloads or with non-anonymous too?

Same as Rik44444, only ever tried with anon downloads (1 hop). I'll try to download something non-anon when my downloads are frozen and come back at you with the result...

devos50 commented 4 years ago

We are trying to capture this nasty bug on our testing machines so hopefully we will have a solution for this issue soon.

egbertbouman commented 4 years ago

I have no problems with 0-hop downloads, but I am also getting regular speed drops with anonymity enabled. While this is expected to some extent, since circuits only have a limited lifetime, I often see circuits getting destroyed well before their traffic limits are met. The cause seems to be that exit nodes are destroying the circuits intentionally, because my balance is too low.

devos50 commented 4 years ago

It could be that there are not enough slots and users are kicked out (and their circuits are thus destroyed). I'm not sure how often this problem occurs specifically but a user is likely to maintain multiple circuits with distinct exit nodes.

Can we return the reason why the circuit got destroyed to the user somehow? If we get multiple of circuit destructions within a short time, we can display a warning to the user (something like: the performance of anonymous downloads might be affected by a low token balance). This will not solve the root problem since users should be able to increase their balance again (which can currently only be done by relaying traffic), but it would inform the user at least.

qstokkink commented 4 years ago

Can we return the reason why the circuit got destroyed to the user somehow?

There is actually a backwards-compatible reason in the wire format for this. We should be able to extend it and deliver it to the user using the Tribler notifier in the TriblerTunnel subclass. Reference:

https://github.com/Tribler/py-ipv8/blob/c5118c3929ce2b0ed9aacdea6561b6a81283e79a/ipv8/messaging/anonymization/community.py#L433-L456

egbertbouman commented 4 years ago

From what I can tell by looking at the code, the most likely reason why an exit would destroy a circuit after creation is due to a low balance. Other reasons are when the exit is unloading the TunnelCommunity or when we've sent more than 50 packets to the outside world without a response. I guess the latter is also a possibility. Of course, it's best not to guess, so using reason field sound like a good idea. Currently, the reason always seems to be 0.

I realize there are almost always multiple circuits available, but switching from one circuit to the next is still going to result in a dip in speed. From what I've experienced the speed almost always goes back to 0. Balancing the load over the existing circuits seems difficult since most of the speed is often coming from a few connections.

ichorid commented 4 years ago

A user reported on our forum:

I was using 7.4.4 for win10. I’ve experienced a weird error these days. At first, all of my downloads couldn’t connect any peers at all. I tried to change settings and all of that. But it didn’t work. So I tried reinstalling it. I didn’t have the luck. So I tried 7.5.0-rc5. Now it seems that the downloads found their peers back, even though they don’t really connect with them nor download anything. But the weirdest thing happened. When I put anything to “zero hoops,” it cannot find a single seed or peer at all. Usually, without anonymity, I could download things much faster. But now anonymity mode doesn’t download things, while non-anonymity mode gives me literally nothing, not even a peer.

Currently, download got much smoother. But still, I have trouble with seeding. Once a file finishes downloading, all the seeds and peers disappear. So Tribler doesn’t seed at all after finishing it. (Of course, my setting’s at unlimited seeding.) Screen Shot 2020-05-16 at 11 37 03 PM

About the non-anonymous download, I just added a torrent, but it doesn’t connect to anything. Funny thing is when I change it into anonymous download, it connects seeds/peers and starts the download. Screen Shot 2020-05-16 at 11 44 18 PM

This is very new. Before, I used non-anonymous download because I didn’t like the speed of anonymous mode. But now the non-anonymous mode doesn’t do anything.

egbertbouman commented 4 years ago

@ichorid We're only seeding through e2e circuits, meaning that you can only upload to other Tribler users. Chances are that there are simply no other Tribler users downloading the torrent. So, I'm not surprised about the problems with seeding.

I've never seen a download that only works if hops>0. Not sure what's happening here.

synctext commented 4 years ago

Bug triggered again, now during "disk checking"!

Tribler 7.4.4 did not have this, 7.5rc5 introduces new behavior that triggers this. Not a Mac only issue, also reported by others on Linux Mate and @Rik44444 on Windows 10.

This is the last big issue we need to resolve before we can finally declare things stable and release 7.5 of Tribler. Tribler RC5 stuck on disk checking with 3 concurrent other swarms active (but not actively downloading, no IO at all, just dead swarms, stuck for weeks at same download percentage). This was not simply disk checking, I deleted a file underneath Tribler. Then starting a disk check to ensure it is in sync with 0% and 100% downloaded files. Worked smoothly before on RC5, but now it triggered some sort of deadlock:

Tribler still used resources and blocked smooth VLC playback. It did not want to shutdown and got stuck (e.g. stuck while being stuck): This started a cmdline fight with Tribler that ended with a laptop reboot. Obviously, this type of Tribler bugs drives our users away permanently. This is the reason why I'm keen on resolving this, at any cost. The ```kill -9 18344``` did not help, it turned Tribler into a zombie :zombie: (btw closing #5307, this is now main issue)
ichorid commented 4 years ago

https://github.com/arvidn/libtorrent/issues/4661 looks like our case

ichorid commented 4 years ago

"set_alert_notify may deadlock when called more than once"

we refactored a lot of Libtorrent-related stuff during 7.5. Another detail is that we refactored Libtorrent wrapper after moving to asyncio. That explains why we only started to experience the problem later.

@egbertbouman , could you please remember if you changed stuff related to alerts/thread during the refactoring?

egbertbouman commented 4 years ago

@ichorid The libtorrent wrapper does everything on the same thread. If it's a threading issue it's more likely because we're calling it from somewhere we shouldn't (i.e., not the asyncio thread).

The basics of how we get the alerts from libtorrent and how we process them are all the same.

BTW, note that we are not actually using set_alert_notify, instead we use pop_alerts to get the alerts from libtorrent.

ichorid commented 4 years ago

Well, that was a good attempt... Not our case. At least we now know not to use set_alert_notify.

ichorid commented 4 years ago

@synctext did you use external media to store the files that got stuck? Something like a USB flash drive or a slow external HDD?

arvidn commented 4 years ago

ruling out set_alert_notify(), I think there are two separate symptoms left:

  1. download rate dropping to zero for some time, and later recovers.
  2. The whole process freezes and is unkillable

Case 1: This could have lots of different causes. The disk throughput being saturated could definitely be one of them, but also dropping peers (for whatever reason). I think a long term approach to debugging issues like this would be to have a feature where libtorrent stats counters can be enabled and recorded into a log file (i.e. [post_session_stats](https://libtorrent.org/reference-Session.html#post_session_stats())). Those counters could be saved in a binary format to save space, but the default alert::message() function will print them in text format, which can be parsed by tools/parse_session_stats.py (in the libtorrent repo). It would record the size of the disk write queue for instance, as well as the number of bytes being written to disk.

Case 2: I'm pretty confident that this symptom always must be caused by one or more threads being "stuck" in system calls that cannot be interrupted, or in modes where they cannot be interrupted. I would expect one such case would be a write() call to disk, on a system that are under heavy memory pressure and cannot easily allocate another dirty page, and maybe has to do a lot of paging activity to find one. Normally you would expect just the disk thread to get stuck though, but there may be non-obvious dependencies that cause other threads to wait for the disk somehow. Or under such condition of the system any paging activity might cause all threads to get stuck in a page fault.

I think the most reasonable way to resolve issues like this would be to sample the process when this happens, which requires debug symbols, or at least some extent of debug symbols to resolve some of the function names. This is problematic for end-users to do, since tools for this aren't easy to come by (Mac OS and BSD has sample and I think windows task manager can sample stack traces from the context menu).

ichorid commented 4 years ago

@arvidn , thank you very much for the clarification. :clap:

ichorid commented 4 years ago

So, I tried to download some torrents on a slow USB stick (non-anonymous). There were 5 popular, highly seeded torrents. The first two I added started immediately, but the speed was very low, like 0.5-15 kb/s. But the remaining 3 torrents were stuck in "Checking" for about 5 minutes. After that, they started too, with the same low speed. This is totally reproducible. I don't know what "Checking" status means in this case. After a couple more minutes the speed for every download will jump up to hundreds of kb/s and for some to even 2 mb/s. Then after a minute, it will again drop to a bytes per second.

@arvidn, is it normal?

ichorid commented 4 years ago

@Rik44444 , @aaguilarjimenez , could you please indicate what kind of media you used while experiencing the bug? Did you download directly to an external USB stick or something like that?

arvidn commented 4 years ago

I'm not sure what "checking" means either. libtorrent has two checking modes. Checking resume data and checking files.

Since it sounds like you didn't have these files already, it probably means checking resume data. That name is a bit misleading since it does things even when you don't have resume data.

It will stat() every file in the torrent at the specified save directory, to see if any file is there. It will also do various kinds of sanity checking of resume data, if it was provided.

If you have a torrent with a lot of files to be stat()ed on a slow drive, I suppose that could take a long time. 5 minutes sounds very long though. It normally takes a fraction of a second. However, if any files were found. it would transition into file-checking mode, where it actually has to read all data from the files and check their hashes.

Whether it's normal or not, I don't know. It probably depends on what the bottleneck is.

synctext commented 4 years ago

yes, relatively slow external USB. plus: this was a fault when data file checking..

ichorid commented 4 years ago

@synctext , we've seen exactly the same problem last year with your laptop using a slow external USB flash drive. I was able to reproduce it today by using a slow external drive. Checking was basically stuck. Also, as pointed out by @arvidn, "Checking" at 0% means libtorrent is trying to stat() the files at the disk in question. If the disk is very busy with i/o, it will probably block.

Was it a USB Flash drive or an HDD? what model (controller, etc.)? what filesystem does it use? Did you try to use it simultaneously with some other process accessing it, e.g. watching Tribler VLC streaming from it?

ichorid commented 4 years ago

@synctext , also, could it be that your GigaChannel database was on the same USB drive? Was Tribler processing/downloading some channels at that moment (especially, big ones)?

synctext commented 4 years ago

with 7.4.4 this checking never gave issues, plus no 0kbps issues with healthy big swarms. no idea on Channel parsing activity. This should be inside Tribler RC6 debug panel then please, channel inserts in last seconds.

synctext commented 4 years ago

btw channel.db is inside .Tribler dir on fast SSD, not on slower USB used for bulk swarm storage.

Reproduced, coool!

ichorid commented 4 years ago

Reliably reproduced on Windows 10 with both 7.4.4 and 7.5. Note that DHT checking works! stuck1 STUCK2

Downloads are stuck for the first few minutes after adding them, then resume.

ichorid commented 4 years ago

I suspect this is not one, but several problems resulting in slightly different, albeit very similar effects. The first one has something to do with DHT - peers connecting. The second one is about libtorrent getting stuck due to slow disk access.

arvidn commented 4 years ago

Is the USB stick formatted with FAT by any chance? i.e. not supporting sparse files?

synctext commented 4 years ago

yes! Dos format, Linux and Mac compatible. But again, it used to work fine.

arvidn commented 4 years ago

is the difference libtorrent-1.1.x and libtorrent-1.2.x?

arvidn commented 4 years ago

I ran a test downloading a large file into a usb stick on linux using client_test (a pretty thin wrapper around libtorrent).

Here area few observations:

In the past, libtorrent used to have a dedicated thread just to close files (here), specifically for this reason. It doesn't anymore though, and maybe it was a mistake to remove that. Although, this was removed such time long ago that I find it unlikely this is what's affecting you.

A few ways I can think of to mitigate this would be to:

It seems this ticket is mostly about windows, and I don't feel confident assuming windows has the same behavior as I've experienced on Linux and Mac OS, but this might at least help with ideas of what's going on.

ichorid commented 4 years ago

@arvidn , thank you very much for your insights! This thread now seems to be about multiple issues that manifest differently on different platforms, but any of them results in 0kb/s downloads.

ichorid commented 4 years ago

Latest 7.5-rc5, Win10, Libtorrent version 1.2.1. Run Tribler, add 3 highly popular torrents by infohashes. Names resolved fine. Then torrents got stuck at "Waiting for metainfo" and remain like that for half an hour. Here are the Libtorrent session stats. Note that there are several Channels torrents in the background, already downloaded and seeded, zero connections.

Libtrorrent stats ```json "dht.dht_allocated_observers" : 0, "dht.dht_announce_peer_in" : 2, "dht.dht_announce_peer_out" : 16, "dht.dht_bytes_in" : 112105, "dht.dht_bytes_out" : 163770, "dht.dht_find_node_in" : 10, "dht.dht_find_node_out" : 0, "dht.dht_get_in" : 0, "dht.dht_get_out" : 0, "dht.dht_get_peers_in" : 76, "dht.dht_get_peers_out" : 1291, "dht.dht_immutable_data" : 0, "dht.dht_invalid_announce" : 0, "dht.dht_invalid_find_node" : 0, "dht.dht_invalid_get" : 0, "dht.dht_invalid_get_peers" : 0, "dht.dht_invalid_put" : 0, "dht.dht_invalid_sample_infohashes" : 0, "dht.dht_messages_in" : 485, "dht.dht_messages_in_dropped" : 64, "dht.dht_messages_out" : 1401, "dht.dht_messages_out_dropped" : 0, "dht.dht_mutable_data" : 0, "dht.dht_node_cache" : 74, "dht.dht_nodes" : 186, "dht.dht_peers" : 2, "dht.dht_ping_in" : 0, "dht.dht_ping_out" : 6, "dht.dht_put_in" : 0, "dht.dht_put_out" : 0, "dht.dht_sample_infohashes_in" : 0, "dht.dht_sample_infohashes_out" : 0, "dht.dht_torrents" : 2, "disk.arc_mfu_ghost_size" : 0, "disk.arc_mfu_size" : 0, "disk.arc_mru_ghost_size" : 0, "disk.arc_mru_size" : 489, "disk.arc_volatile_size" : 0, "disk.arc_write_size" : 0, "disk.blocked_disk_jobs" : 0, "disk.disk_blocks_in_use" : 1955, "disk.disk_hash_time" : 80606, "disk.disk_job_time" : 122801, "disk.disk_read_time" : 0, "disk.disk_write_time" : 42195, "disk.num_blocks_cache_hits" : 0, "disk.num_blocks_hashed" : 1955, "disk.num_blocks_read" : 0, "disk.num_blocks_written" : 1955, "disk.num_fenced_check_fastresume" : 0, "disk.num_fenced_clear_piece" : 0, "disk.num_fenced_delete_files" : 0, "disk.num_fenced_file_priority" : 0, "disk.num_fenced_flush_hashed" : 0, "disk.num_fenced_flush_piece" : 0, "disk.num_fenced_flush_storage" : 0, "disk.num_fenced_hash" : 0, "disk.num_fenced_load_torrent" : 0, "disk.num_fenced_move_storage" : 0, "disk.num_fenced_read" : 0, "disk.num_fenced_release_files" : 0, "disk.num_fenced_rename_file" : 0, "disk.num_fenced_save_resume_data" : 0, "disk.num_fenced_stop_torrent" : 0, "disk.num_fenced_tick_storage" : 0, "disk.num_fenced_trim_cache" : 0, "disk.num_fenced_write" : 0, "disk.num_jobs" : 0, "disk.num_read_back" : 0, "disk.num_read_jobs" : 0, "disk.num_read_ops" : 0, "disk.num_running_disk_jobs" : 0, "disk.num_running_threads" : 0, "disk.num_write_jobs" : 0, "disk.num_write_ops" : 489, "disk.num_writing_threads" : 0, "disk.pinned_blocks" : 0, "disk.queued_disk_jobs" : 0, "disk.queued_write_bytes" : 0, "disk.read_cache_blocks" : 1955, "disk.request_latency" : 0, "disk.write_cache_blocks" : 0, "net.has_incoming_connections" : 1, "net.limiter_down_bytes" : 0, "net.limiter_down_queue" : 0, "net.limiter_up_bytes" : 0, "net.limiter_up_queue" : 0, "net.on_accept_counter" : 3, "net.on_disk_counter" : 531, "net.on_disk_queue_counter" : 0, "net.on_lsd_counter" : 0, "net.on_lsd_peer_counter" : 11451, "net.on_read_counter" : 3402, "net.on_tick_counter" : 528, "net.on_udp_counter" : 13243, "net.on_write_counter" : 10356, "net.recv_bytes" : 34188365, "net.recv_failed_bytes" : 0, "net.recv_ip_overhead_bytes" : 1438800, "net.recv_payload_bytes" : 34110239, "net.recv_redundant_bytes" : 2081941, "net.recv_tracker_bytes" : 0, "net.sent_bytes" : 423562, "net.sent_ip_overhead_bytes" : 1492448, "net.sent_payload_bytes" : 0, "net.sent_tracker_bytes" : 0, "peer.aborted_peers" : 0, "peer.addrinuse_peers" : 0, "peer.banned_for_hash_failure" : 0, "peer.boost_connection_attempts" : 66, "peer.broken_pipe_peers" : 0, "peer.buffer_peers" : 0, "peer.cancelled_piece_requests" : 0, "peer.choked_piece_requests" : 0, "peer.connaborted_peers" : 0, "peer.connection_attempt_loops" : 23011, "peer.connection_attempts" : 626, "peer.connect_timeouts" : 321, "peer.connrefused_peers" : 4, "peer.connreset_peers" : 19, "peer.disconnected_peers" : 687, "peer.eof_peers" : 33, "peer.error_encrypted_peers" : 0, "peer.error_incoming_peers" : 0, "peer.error_outgoing_peers" : 41, "peer.error_peers" : 0, "peer.error_rc4_peers" : 0, "peer.error_tcp_peers" : 40, "peer.error_utp_peers" : 1, "peer.incoming_connections" : 1, "peer.invalid_arg_peers" : 0, "peer.invalid_piece_requests" : 0, "peer.max_piece_requests" : 0, "peer.missed_connection_attempts" : 0, "peer.no_access_peers" : 0, "peer.no_memory_peers" : 0, "peer.no_peer_connection_attempts" : 3373, "peer.notconnected_peers" : 0, "peer.num_banned_peers" : 0, "peer.num_http_proxy_peers" : 0, "peer.num_i2p_peers" : 0, "peer.num_peers_connected" : 6, "peer.num_peers_down_disk" : 0, "peer.num_peers_down_interested" : 0, "peer.num_peers_down_requests" : 55, "peer.num_peers_down_unchoked" : 0, "peer.num_peers_end_game" : 0, "peer.num_peers_half_open" : 0, "peer.num_peers_up_disk" : 0, "peer.num_peers_up_interested" : 0, "peer.num_peers_up_requests" : 0, "peer.num_peers_up_unchoked" : 0, "peer.num_peers_up_unchoked_all" : 0, "peer.num_peers_up_unchoked_optimistic" : 0, "peer.num_socks5_peers" : 0, "peer.num_ssl_http_proxy_peers" : 0, "peer.num_ssl_peers" : 0, "peer.num_ssl_socks5_peers" : 0, "peer.num_ssl_utp_peers" : 0, "peer.num_tcp_peers" : 6, "peer.num_utp_peers" : 0 "peer.perm_peers" : 0, "peer.piece_rejects" : 0, "peer.piece_requests" : 0, "peer.timeout_peers" : 0, "peer.too_many_peers" : 0, "peer.transport_timeout_peers" : 301, "peer.uninteresting_peers" : 14, "peer.unreachable_peers" : 0, "picker.end_game_piece_picks" : 0, "picker.hash_fail_piece_picks" : 0, "picker.incoming_piece_picks" : 610, "picker.incoming_redundant_piece_picks" : 95, "picker.interesting_piece_picks" : 132, "picker.piece_picker_busy_loops" : 336, "picker.piece_picker_partial_loops" : 9, "picker.piece_picker_rand_loops" : 0, "picker.piece_picker_rand_start_loops" : 329, "picker.piece_picker_rare_loops" : 52, "picker.piece_picker_reverse_rare_loops" : 0, "picker.piece_picker_sequential_loops" : 0, "picker.piece_picker_suggest_loops" : 0, "picker.reject_piece_picks" : 0, "picker.snubbed_piece_picks" : 0, "picker.unchoke_piece_picks" : 77, "ses.non_filter_torrents" : 9, "ses.num_checking_torrents" : 0, "ses.num_downloading_torrents" : 3, "ses.num_error_torrents" : 0, "ses.num_have_pieces" : 502, "ses.num_incoming_allowed_fast" : 0, "ses.num_incoming_bitfield" : 0, "ses.num_incoming_cancel" : 0, "ses.num_incoming_choke" : 5, "ses.num_incoming_dht_port" : 75, "ses.num_incoming_extended" : 190, "ses.num_incoming_ext_handshake" : 101, "ses.num_incoming_have" : 0, "ses.num_incoming_have_all" : 0, "ses.num_incoming_have_none" : 0, "ses.num_incoming_interested" : 0, "ses.num_incoming_metadata" : 9, "ses.num_incoming_not_interested" : 0, "ses.num_incoming_pex" : 72, "ses.num_incoming_piece" : 2065, "ses.num_incoming_reject" : 0, "ses.num_incoming_request" : 0, "ses.num_incoming_suggest" : 0, "ses.num_incoming_unchoke" : 144, "ses.num_outgoing_allowed_fast" : 0, "ses.num_outgoing_bitfield" : 126, "ses.num_outgoing_cancel" : 347, "ses.num_outgoing_choke" : 19, "ses.num_outgoing_dht_port" : 89, "ses.num_outgoing_extended" : 174, "ses.num_outgoing_ext_handshake" : 101, "ses.num_outgoing_have" : 30394, "ses.num_outgoing_have_all" : 14, "ses.num_outgoing_have_none" : 31, "ses.num_outgoing_interested" : 78, "ses.num_outgoing_metadata" : 17, "ses.num_outgoing_not_interested" : 78, "ses.num_outgoing_pex" : 79, "ses.num_outgoing_piece" : 0, "ses.num_outgoing_reject" : 0, "ses.num_outgoing_request" : 2130, "ses.num_outgoing_suggest" : 0, "ses.num_outgoing_unchoke" : 42, "ses.num_outstanding_accept" : 1, "ses.num_piece_failed" : 0, "ses.num_piece_passed" : 502, "ses.num_queued_download_torrents" : 0, "ses.num_queued_seeding_torrents" : 0, "ses.num_seeding_torrents" : 6, "ses.num_stopped_torrents" : 0, "ses.num_total_pieces_added" : 502, "ses.num_unchoke_slots" : 8, "ses.num_upload_only_torrents" : 0, "ses.torrent_evicted_counter" : 0, "ses.waste_piece_cancelled" : 0, "ses.waste_piece_closing" : 282123, "ses.waste_piece_end_game" : 1261568, "ses.waste_piece_seed" : 0, "ses.waste_piece_timed_out" : 16384, "ses.waste_piece_unknown" : 521866, "sock_bufs.socket_recv_size10" : 828, "sock_bufs.socket_recv_size11" : 379, "sock_bufs.socket_recv_size12" : 249, "sock_bufs.socket_recv_size13" : 214, "sock_bufs.socket_recv_size14" : 395, "sock_bufs.socket_recv_size15" : 449, "sock_bufs.socket_recv_size16" : 20, "sock_bufs.socket_recv_size17" : 0, "sock_bufs.socket_recv_size18" : 0, "sock_bufs.socket_recv_size19" : 0, "sock_bufs.socket_recv_size20" : 0, "sock_bufs.socket_recv_size3" : 244, "sock_bufs.socket_recv_size4" : 36, "sock_bufs.socket_recv_size5" : 46, "sock_bufs.socket_recv_size6" : 199, "sock_bufs.socket_recv_size7" : 126, "sock_bufs.socket_recv_size8" : 155, "sock_bufs.socket_recv_size9" : 84, "sock_bufs.socket_send_size10" : 0, "sock_bufs.socket_send_size11" : 0, "sock_bufs.socket_send_size12" : 0, "sock_bufs.socket_send_size13" : 0, "sock_bufs.socket_send_size14" : 0, "sock_bufs.socket_send_size15" : 0, "sock_bufs.socket_send_size16" : 0, "sock_bufs.socket_send_size17" : 0, "sock_bufs.socket_send_size18" : 0, "sock_bufs.socket_send_size19" : 0, "sock_bufs.socket_send_size20" : 0, "sock_bufs.socket_send_size3" : 5934, "sock_bufs.socket_send_size4" : 1253, "sock_bufs.socket_send_size5" : 1294, "sock_bufs.socket_send_size6" : 1100, "sock_bufs.socket_send_size7" : 529, "sock_bufs.socket_send_size8" : 168, "sock_bufs.socket_send_size9" : 78, "utp.num_utp_close_wait" : 0, "utp.num_utp_connected" : 0, "utp.num_utp_deleted" : 0, "utp.num_utp_fin_sent" : 0, "utp.num_utp_idle" : 0, "utp.num_utp_syn_sent" : 0, "utp.utp_fast_retransmit" : 0, "utp.utp_invalid_pkts_in" : 0, "utp.utp_packet_loss" : 0, "utp.utp_packet_resend" : 114, "utp.utp_packets_in" : 33719, "utp.utp_packets_out" : 21163, "utp.utp_payload_pkts_in" : 26065, "utp.utp_payload_pkts_out" : 7652, "utp.utp_redundant_pkts_in" : 96, "utp.utp_samples_above_target" : 94, "utp.utp_samples_below_target" : 7443, "utp.utp_timeout" : 403, ```
Libtrorrent settings ```json "active_checking" : 1, "active_dht_limit" : 88, "active_downloads" : 3, "active_limit" : 500, "active_loaded_limit" : 0, "active_lsd_limit" : 60, "active_seeds" : 5, "active_tracker_limit" : 1600, "aio_max" : 300, "aio_threads" : 4, "alert_mask" : 1, "alert_queue_size" : 1000, "allowed_enc_level" : 3, "allowed_fast_set_size" : 5, "allow_i2p_mixed" : false, "allow_multiple_connections_per_ip" : 0, "allow_partial_disk_writes" : true, "always_send_user_agent" : false, "announce_crypto_support" : true, "announce_double_nat" : false, "announce_ip" : "", "announce_to_all_tiers" : false, "announce_to_all_trackers" : false, "anonymous_mode" : false, "apply_ip_filter_to_trackers" : true, "auto_manage_interval" : 30, "auto_manage_prefer_seeds" : false, "auto_manage_startup" : 60, "auto_scrape_interval" : 1800, "auto_scrape_min_interval" : 300, "auto_sequential" : true, "ban_web_seeds" : true, "broadcast_lsd" : true, "cache_buffer_chunk_size" : 0, "cache_expiry" : 300, "cache_size" : 2048, "cache_size_volatile" : 256, "checking_mem_usage" : 1024, "choking_algorithm" : 0, "close_file_interval" : 120, "close_redundant_connections" : true, "coalesce_reads" : true, "coalesce_writes" : true, "connections_limit" : 200, "connection_speed" : 30, "connections_slack" : 10, "connect_seed_every_n_download" : 10, "contiguous_recv_buffer" : true, "decrease_est_reciprocation_rate" : 3, "default_cache_min_age" : 1, "default_est_reciprocation_rate" : 16000, "dht_announce_interval" : 900, "dht_bootstrap_nodes" : "dht.libtorrent.org:25401", "dht_upload_rate_limit" : 4000, "disable_hash_checks" : false, "disk_io_read_mode" : 0, "disk_io_write_mode" : 0, "dont_count_slow_torrents" : true, "dont_flush_write_cache" : false, "download_rate_limit" : 0, "enable_dht" : false, "enable_incoming_tcp" : true, "enable_incoming_utp" : true, "enable_ip_notifier" : true, "enable_lsd" : false, "enable_natpmp" : false, "enable_outgoing_tcp" : true, "enable_outgoing_utp" : true, "enable_upnp" : false, "file_checks_delay_per_block" : 0, "file_pool_size" : 40, "force_proxy" : false, "guided_read_cache" : false, "half_open_limit" : 0, "handshake_client_version" : "Tribler/7.5.0-rc5-14-g7b166f0d7/...", "handshake_timeout" : 10, "hashing_threads" : 1, "i2p_hostname" : "", "i2p_port" : 0, "ignore_limits_on_local_network" : true, "ignore_resume_timestamps" : false, "inactive_down_rate" : 2048, "inactive_up_rate" : 2048, "inactivity_timeout" : 600, "incoming_starts_queued_torrents" : false, "increase_est_reciprocation_rate" : 20, "in_enc_policy" : 1, "initial_picker_threshold" : 4, "lazy_bitfields" : false, "listen_interfaces" : "0.0.0.0:15670", "listen_queue_size" : 5, "listen_system_port_fallback" : true, "local_download_rate_limit" : 0, "local_service_announce_interval" : 300, "local_upload_rate_limit" : 0, "lock_disk_cache" : false, "lock_files" : false, "low_prio_disk" : true, "max_allowed_in_request_queue" : 500, "max_failcount" : 3, "max_http_recv_buffer_size" : 835584, "max_metadata_size" : 31457280, "max_out_request_queue" : 500, "max_paused_peerlist_size" : 1000, "max_peerlist_size" : 3000, "max_peer_recv_buffer_size" : 2097152, "max_pex_peers" : 50, "max_queued_disk_bytes" : 1048576, "max_rejects" : 50, "max_retry_port_bind" : 10, "max_suggest_pieces" : 16, "max_web_seed_connections" : 3, "min_announce_interval" : 300, "min_reconnect_time" : 60, "mixed_mode_algorithm" : 1, "mmap_cache" : "", "network_threads" : 0, "no_atime_storage" : true, "no_connect_privileged_ports" : false, "no_recheck_incomplete_resume" : false, "num_optimistic_unchoke_slots" : 0, "num_outgoing_ports" : 1, "num_want" : 200, "optimistic_disk_retry" : 600, "optimistic_unchoke_interval" : 30, "out_enc_policy" : 1, "outgoing_interfaces" : "", "outgoing_port" : 0, "peer_connect_timeout" : 15, "peer_fingerprint" : "...", "peer_timeout" : 120, "peer_tos" : 32, "peer_turnover" : 4, "peer_turnover_cutoff" : 90, "peer_turnover_interval" : 300, "piece_timeout" : 20, "predictive_piece_announce" : 0, "prefer_rc4" : true, "prefer_udp_trackers" : true, "prioritize_partial_pieces" : false, "proxy_hostname" : "", "proxy_hostnames" : true, "proxy_password" : "", "proxy_peer_connections" : true, "proxy_port" : 0, "proxy_tracker_connections" : true, "proxy_type" : 0, "proxy_username" : "", "rate_limit_ip_overhead" : true, "rate_limit_utp" : true, "read_cache_line_size" : 32, "recv_socket_buffer_size" : 0, "report_redundant_bytes" : true, "report_true_downloaded" : false, "report_web_seed_downloads" : true, "request_queue_time" : 3, "request_timeout" : 60, "resolver_cache_timeout" : 1200, "seed_choking_algorithm" : 0, "seeding_outgoing_connections" : true, "seeding_piece_quota" : 20, "seed_time_limit" : 86400, "seed_time_ratio_limit" : 700, "send_buffer_low_watermark" : 10240, "send_buffer_watermark" : 512000, "send_buffer_watermark_factor" : 50, "send_redundant_have" : true, "send_socket_buffer_size" : 0, "share_mode_target" : 3, "share_ratio_limit" : 200, "smooth_connects" : true, "ssl_listen" : 0, "stop_tracker_timeout" : 5, "strict_end_game_mode" : true, "strict_super_seeding" : false, "suggest_mode" : 0, "support_merkle_torrents" : true, "support_share_mode" : true, "tick_interval" : 500, "torrent_connect_boost" : 30, "tracker_backoff" : 250, "tracker_completion_timeout" : 30, "tracker_maximum_response_length" : 1048576, "tracker_receive_timeout" : 10, "udp_tracker_token_expiry" : 60, "unchoke_interval" : 15, "unchoke_slots_limit" : 8, "upload_rate_limit" : 0, "upnp_ignore_nonrouters" : false, "urlseed_max_request_bytes" : 16777216, "urlseed_pipeline_size" : 5, "urlseed_timeout" : 20, "urlseed_wait_retry" : 30, "use_dht_as_fallback" : false, "use_disk_cache_pool" : false, "use_disk_read_ahead" : true, "use_parole_mode" : true, "user_agent" : "...", "use_read_cache" : true, "use_write_cache" : true, "utp_connect_timeout" : 3000, "utp_cwnd_reduce_timer" : 100, "utp_delayed_ack" : 0, "utp_fin_resends" : 2, "utp_gain_factor" : 3000, "utp_loss_multiplier" : 50, "utp_min_timeout" : 500, "utp_num_resends" : 3 "utp_syn_resends" : 2, "utp_target_delay" : 100, "volatile_read_cache" : false, "web_seed_name_lookup_retry" : 1800, "whole_pieces_threshold" : 20, "write_cache_line_size" : 16, ```

@arvidn , can you make something out of these stats?

ichorid commented 4 years ago

Also, I noticed that I cannot reproduce this bug on my PC unless I have another instance of Tribler running on my seedbox in the same subnet at my home, behind the same NAT. Tribler running on the seedbox -> the bug triggers sometimes on the PC. Tribler not running on the seedox -> the bug does not trigger on the PC.

@synctext , could it be that another instance of Tribler or another torrent client was running in your home network when you witnessed the freeze?

arvidn commented 4 years ago

those counters suggest that the DHT was properly bootstrapped and connected to the network. They also suggested that there were peers connected, and no peers failing with an error (which I would have expected if there were some issue with the ut_metadata extension, that downloads the .torrent from peers). It's not obvious what's going on. Enabling logging would probably help.

ichorid commented 4 years ago

@arvidn , is it possible to enable Libtorrent logging from Python bindings?

arvidn commented 4 years ago

yes, set the alert_mask setting to include the peer_log_notification flag (documented here)

ichorid commented 4 years ago

BTW, I've checked Libtorrent versions used in our latest releases:

MacOS Win64
Tribler 7.4.4 1.2.2 1.2.1
Tribler 7.5-rc5 1.2.2 1.2.1
ichorid commented 4 years ago

:skull: Can't reproduce it anymore :skull: Dozens of restarts, everything runs perfectly on the same PC where yesterday it was triggered every second run :dizzy_face:

I suspect it is only triggered by a very specific race condition. In that case, using stable, deterministic environments (like VMs and app tester) will not suffice. To catch it for good, we'll probably have to produce a specialized "debug" release with enabled debug symbols, full logging and a watchdog that will automatically send the last 5 minutes of logs and some kind of Libtorrent stack trace to the Reporter server.

ichorid commented 4 years ago

Accidentally reproduced on Ubuntu at home (Libtorrent 1.2.2): изображение

ichorid commented 4 years ago

You know what's funny? When I restarted Tribler, only the OpenSUSE torrent was downloading. Ubuntu was in a "stopped" state. :thinking:

egbertbouman commented 4 years ago

@ichorid Not sure why a download would be "stopped" directly after adding it. Are you sure that you didn't try to stop the download while it was still waiting for the metainfo?

I just did some testing, and it could be that the problem with the download being stuck in "Waiting for metadata" is because we're not receiving any peers from the DHT. To test this I added a call to force_dht_announce at the bottom of Download.on_add_torrent_alert, and the download seems to start a lot faster. Could possibly be due to the fact we just did a DHT request when we opened the download dialog.

ichorid commented 4 years ago

@egbertbouman , I'm sure I did not press the "stop" button before closing Tribler. Also, I had disabled the download dialog through the settings menu beforehand to reduce the amount of clicking...