arvidn / libtorrent

an efficient feature complete C++ bittorrent implementation
http://libtorrent.org
Other
5.16k stars 994 forks source link

Websocket trackers stuck updating #7281

Open SilentBot1 opened 1 year ago

SilentBot1 commented 1 year ago

Please provide the following information

libtorrent version (or branch): master@https://github.com/arvidn/libtorrent/commit/cbc4352f30c762740c53b74fef732d756aa313cb

Platform/architecture: Windows 11 x64, Ubuntu 22.04 x64

Compiler and compiler version: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Ubuntu 22.04 x64 Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30145 for x64 on Windows 11 x64

Please describe what symptom you see, what you would expect to see instead and how to reproduce it:

When using a compiled version of libtorrent 2.1.0.0 (libtorrent@cbc435) compiled for qBittorrent v4.6.0apha1, in conjunction with bittorrent-tracker (bittorrent-tracker@0b581e) and aquatic-ws (aquatic-ws@5ac8f3), I have noticed that when there are multiple announces at the same time via a websocket tracker, some instances of the tracker get stuck in the updating state.

This issue can be observed if there is a libtorrent client connected to a websocket tracker, when multiple announces are performed at the same time, some torrents have their instance of the websocket tracker stuck in the updating state and no longer provide further announcements, but also never transition out of the updating state without removing/re-adding the tracker, or restarting the client.

This effect gets procedurally worse over time, until all instances are stuck in the updating state, with only websocket ping requests being sent back and fourth between the tracker and client.

Via a packet capture, I have confirmed that all announces that are sent to the websocket tracker have a valid response returned, though some of these still get stuck in the updating state.

In my testing with 10 new, unique, torrents, going to only a local websocket tracker with a 60 second announce interval, my results were:

As a side note to this, after only having a single torrent, this appears to have died after 48 announces / 48 minutes / 480 offers.

Steps to reproduce:

  1. Build libtorrent@https://github.com/arvidn/libtorrent/commit/cbc4352f30c762740c53b74fef732d756aa313cb from main with the -Dwebtorrent flag enabled.
  2. Start a local bittorrent-tracker tracker with the DEBUG=* environment variable set.
  3. Start multiple (10+ used in testing) torrent/magnets which contains the above bittorrent-tracker via libtorrent and start/unpause/restart the client so all announce at the same time. Note: In testing, there were no other peers for any of the torrents.
  4. Verify that certain torrents are no longer announcing to the tracker, and that the torrent tracker state is now stuck as "updating".

If any further information is needed to help in the debugging of this, please don't hesitate to let me know.

SilentBot1 commented 1 year ago

To add some further details to this, after testing this within client_test included in the examples folder, with the --alert_mask=tracker,torrent_log enabled, it appears that after the client becomes unresponsive to the tracker, the following alerts are fired when an offer is received from the tracker:

*** WEBSOCKET_TRACKER_READ
*** RTC signaling processing remote offer
*** RTC signaling creating connection

But there are no further *** RTC signaling generated answer or *** WEBSOCKET_TRACKER_WRITE alerts fired.

Compared to when the tracker is working, all five events are fired in order:

*** WEBSOCKET_TRACKER_READ
*** RTC signaling processing remote offer
*** RTC signaling creating connection
*** RTC signaling generated answer
*** WEBSOCKET_TRACKER_WRITE

Based on this, it appears to go astray between rtc_signaling::process_offer and rtc_signaling::create_connection, before rtc_signaling::on_generated_answer is called.

Sorry to directly ping, but is this something you would possibly know about / be able to assist with @paullouisageneau?

paullouisageneau commented 1 year ago

Thank you for the very detailed report!

I think what's happening here is that libdatachannel encounters an error while creating the connection and generating the answer, but it's silently ignored here. Possible reasons for such a failure are running out of file descriptors or running out of usable UDP ports. Enabling DEBUG_RTC here might give information about what's happening under the hood.

The WebTorrent connection process is extremely greedy in terms of UDP sockets as each offer requires one, therefore each batch creates 10 of them. There might be an issue with cleaning up failed connections resulting in a file descriptor leak or just too many file descriptors (the default limit on Linux is 1024). I'll look into it as soon as I get a moment.

SilentBot1 commented 1 year ago

No worries at all and thank you for looking into this when you have time.

After running the same test as above, now with DEBUG_RTC set to 1, and with the flags --alert_mask=tracker,torrent_log,session_log, I can now see the following debug logs after the issue starts occurring, when an announce comes in after the client stops responding:

*** RTC signaling processing remote offer
*** RTC signaling creating connection
libdatachannel: debug rtc::impl::Certificate::Generate@213: Generating certificate (OpenSSL)
libdatachannel: debug rtc::impl::IceTransport::IceTransport@49: Initializing ICE transport (libjuice)
libdatachannel: info rtc::impl::IceTransport::IceTransport@104: Using STUN server "stun.l.google.com:19302"
libdatachannel: info rtc::impl::PeerConnection::changeSignalingState@1166: Changed signaling state to have-remote-offer
libdatachannel: debug rtc::impl::PeerConnection::processLocalDescription::<lambda_7d5d78206bf58d91eaee80ae3f35c4c1>::operator ()@837: Reciprocating application in local description, mid="0"
libdatachannel: info rtc::impl::PeerConnection::changeSignalingState@1166: Changed signaling state to stable
libdatachannel: info rtc::impl::PeerConnection::changeGatheringState@1153: Changed gathering state to in-progress
libdatachannel: info rtc::impl::IceTransport::LogCallback@342: juice: agent.c:1012: Changing state to gathering
libdatachannel: info rtc::impl::IceTransport::LogCallback@342: juice: agent.c:1012: Changing state to connecting
libdatachannel: info rtc::impl::PeerConnection::changeState@1135: Changed state to connecting
libdatachannel: info rtc::impl::IceTransport::LogCallback@342: juice: agent.c:399: Using STUN server stun.l.google.com:19302

This appears to be missing the further steps, which are logged in a successful offer generation, which are:

libdatachannel: info rtc::impl::IceTransport::LogCallback@342: juice: agent.c:1298: Got STUN mapped address REDACTED:57539 from server
libdatachannel: info rtc::impl::IceTransport::LogCallback@342: juice: agent.c:2277: Candidate gathering done
libdatachannel: info rtc::impl::PeerConnection::changeGatheringState@1153: Changed gathering state to complete
*** RTC signaling generated answer

I'll try and see if I can do a bit of further digging.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ThaUnknown commented 1 year ago

stale, no!

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

paullouisageneau commented 12 months ago

I'll look into it.