RC_2_0 - low download speed compared to RC_1_2

Alexey104 commented 2 years ago

libtorrent version (or branch): 2.0.4.0 platform/architecture: ArchLinux_x64 compiler and compiler version: GCC 11.1.0

Hello! I am using the latest Deluge(2.0.4.dev85) and have just switched from LT 1.2.14 to 2.0.4. I use the following LT settings(the others are defaults):

announce_to_all_tiers = true;
inactive_down_rate = 102400;
inactive_up_rate = 102400;
max_allowed_in_request_queue = 2000;
max_out_request_queue = 2000;
mixed_mode_algorithm = prefer_tcp;
send_buffer_low_watermark = 256 * 1024;
send_buffer_watermark = 4 * 1024 * 1024;

I've been using these settings for a long time, and they have always worked good for me with RC_1_2. My ISP speed is 600 Mb/s, and with the settings above I have 40-65 MB/s download speed on good torrents(with many seeds) when using LT 1.2.14, but when using LT 2.0.4 my download speed on the same torrents with the same settings is only 10-25 MB/s despite thousands of seeds.

I've compiled Libtorrent with the following command:

b2 -j $(nproc) crypto=openssl cxxstd=14

Python bindings:

cd ./bindings/python &&
b2 -j $(nproc) python=3.9 address-model=64 toolset=gcc install_module

Yes, I use slow HDD for torrenting, however, download speed becomes 2-3 times faster if I just switch back to RC_1_2 with the same client and libtorrent settings.

Are there any settings I can try to tune to improve my download speed? Thank you!

arvidn commented 2 years ago

are you using a 64 bit build of libtorrent?

The 64 bit build uses memory mapped files and multi-threaded disk I/O, which is expected to provide better performance compared to the 32 bit counterpart. It uses fopen()/fclose(), is single-threaded and opens and closes the file for every access (which normally isn't very fast on windows).

arvidn commented 2 years ago

I see, you're building the binding in 64 bit mode, but not the main library. However, on linux I believe 64 bit builds is the default anyway, so that probably doesn't make a difference.

arvidn commented 2 years ago

Do you know whether you're CPU-, network- or Disk I/O bound?

Alexey104 commented 2 years ago

Thank you for your feedback!

are you using a 64 bit build of libtorrent?

Yes, it's 64-bit. I've recompiled with address-model=64 just to be sure.

Do you know whether you're CPU-, network- or Disk I/O bound?

Disk is definitely the bottleneck for me. I use an old slow 5200rpm 2TB WD Red HDD for torrenting. But here below is the comparison of downloading the same torrent using RC_1_2(upper) and RC_2_0(lower) :

RC_1_2

RC_2_0 (online-video-cutter com)

rafi-d commented 2 years ago

Disk is definitely the bottleneck for me.

Have you tried with RAMdrive to confirm that? @arvidn : what's the change in RC2 vs RC1.x ?

arvidn commented 2 years ago

the main (disk related) change is the use of memory mapped files for disk I/O. But since not all systems support memory mapped files, there's a simple, single-threaded fopen()-based fallback for such cases. The mmap-backend also maps whole files, which means it needs a lot of virtual memory, so is only enable in 64 bit builds.

I say this just to make sure that it is in fact the mmap-based back-end you end up using. If not, that would be the most obvious explanation for the slow-down.

Other than that, there are session-stats that can be printed to the log (and analysed with a script in libtorrent). Short of doing that though, you could experiment with various disk I/O settings, like the number of I/O threads.

I would expect the kernel to do a decent job of caching disk writes in memory and then write it to disk in a reasonable order to maximize throughput. libtorrent-1.2.x is not sophisticated in which order it writes blocks to disk. Obviously, Linux won't cache dirty pages indefinitely and once you reach a steady state might be limited by the HDD write speed.

Do you have a lot of RAM?

One change I can imagine might make a difference is that libtorrent-1.2.x had an application level disk cache which it would allocate and store blocks in while waiting for the disk. It's possibly that the torrents you download all fit in your working set, and can essentially download to RAM, and then flush to disk in the background. In libtorrent-2.0 it's up to the kernel to balance how many dirty pages it keeps for the bittorrent process before it starts throttling it. So it's possible the kernel is more conservative than the "manual" disk cache in 1.2.x.

Do you experience a high download rate at first, to have it drop once some cache threshold is reached?

Alexey104 commented 2 years ago

Do you have a lot of RAM?

8 Gigs.

Do you experience a high download rate at first, to have it drop once some cache threshold is reached?

No, download speed is constantly low from the beginning to the end(compared to RC_1_2).

One change I can imagine might make a difference is that libtorrent-1.2.x had an application level disk cache which it would allocate and store blocks in while waiting for the disk. It's possibly that the torrents you download all fit in your working set, and can essentially download to RAM, and then flush to disk in the background. In libtorrent-2.0 it's up to the kernel to balance how many dirty pages it keeps for the bittorrent process before it starts throttling it. So it's possible the kernel is more conservative than the "manual" disk cache in 1.2.x.

Yes, I can see the difference in that with RC_1_2 RAM is constantly filled with cache_size * 16 KiB blocks of data during the entire session, whereas with RC_2_0 cache_size doesn't matter anymore, and RAM usage changes dynamically depending on the current workload.

Other than that, there are session-stats that can be printed to the log (and analysed with a script in libtorrent). Short of doing that though, you could experiment with various disk I/O settings, like the number of I/O threads.

Thank you, I will dig into it when I have more time and report back if achieve any progress on this.

master255 commented 2 years ago

@Alexey104 To prove that you are right, you can take two programs: Libretorrent (2.0.4 libtorrent) and Media Library (1.2.13 libtorrent). Install it on one Android device and enable video recording. One by one download the same torrent. Upload the video to the cloud. For example mega.nz. Then post the link here.

peacey commented 2 years ago

I'm also experiencing 3x slower download speeds with libtorrent 2.0.5 compared to 1.2.x (with qbt frontend on Arch Linux). I have a 2.5 Gbps connection and when using 1.2.x I'm able to saturate my connection at a sustained ~300 MiB/s (2500 Mbps) with ~100 seeders (single or multiple torrents). When I switch to libtorrent 2.x, I can only reach 100 MiB/s (~700 Mbps) no matter how many torrents or seeders. That means I am losing 1.5-2 Gbps worth of download speed. That's quite a hit. Switching back to 1.2.x immediately solves the problem.

I have tried playing with the IO threads but no dice. Are there other options I can try to change to replicate the old behaviour and download at a sustained 2.5 Gbps again?

FYI, I'm using an NVMe drive and able to read/write at 2600 MiB/s so drive speed isn't an issue. Also, my system has 8 GB of RAM and a dual-core i5-7360U. But these specs were fine to saturate my speed before with libtorrent 1.2.x.

arvidn commented 2 years ago

I'm also experiencing 3x slower download speeds with libtorrent 2.0.5 compared to 1.2.x (with qbt frontend on Arch Linux).

"3x slower" means 1/3 the rate, right?

I have tried playing with the IO threads but no dice. Are there other options I can try to change to replicate the old behaviour and download at a sustained 2.5 Gbps again?

You could try increasing max_queued_disk_bytes, which defaults to 2 MB.

peacey commented 2 years ago

Thanks @arvidn. Yes, 1/3rd the rate. I tried to increase max_queued_disk_bytes in the source code, but it didn't change anything on my setup.

However, I did find the solution. qBittorrent defaults to a file_pool_size of 5000. When I put this down to 40 (the default in the libtorrent reference), I am able to download at a sustained 2.5 Gbps with 2.x. I set it to 10 because I found that more stable (less sudden drops in throughput). Is there a consequence of setting it that low?

An observation with memory usage (possibly related to #6667 and new mmap structure):

With 1.x, I was able to saturate my 2.5G connection with a file pool size of 5000. htop reports VIRT 2475M and MEM 13% for the qbt process.
With 2.x, a file pool size of 5000 causes VIRT 2500G+ and MEM 60% for the process. My download speed degrades to a 1/3rd of the rate in this case. Not sure if the 60% MEM and 2500G VIRT mmap-ed memory is causing the decrease in my throughput here?
With 2.x, a file pool size of 10-40 results in VIRT 30-200G and MEM 8%, and I'm able to sustain 2.5G download rate again.

arvidn commented 2 years ago

@sledgehammer999 ^^ it sounds like a smaller default file_pool_size makes sense

arvidn commented 2 years ago

I set it to 10 because I found that more stable (less sudden drops in throughput). Is there a consequence of setting it that low?

The file pool is a cache of open files (and memory maps of those files). The size determines how many files are kept open at most. So, a small cache size means there will be more calls to open(), mmap(), munmap() and close(). Those calls aren't terribly expensive on linux, so more calls probably won't make a very big difference.

I suspect the main reason this affects your download speed is that the virtual address space, and potential working set size increase is loosely proportional to the number of files you keep open. As you noted in (2), the virtual memory size is large when mapping a lot of large files. The working set being large just means the kernel hasn't found a better use for those pages yet. They should be among the first to be reclaimed when your system needs more RAM.

Using a smaller file pool will cause files to be munmap()ed more often, which is a pretty strong hint to the kernel that it should flush those dirty pages to disk.

It's not clear to me how the larger working set causes the slowdown in download rate though. Perhaps the process is being throttled in allocating new pages, or perhaps just the kernel maintenance of the pages is costly. There's more work to be done tuning this. One of my ideas is to test the new MADV_COLD advice, to really make pages first-in-line to be reclaimed.

rafi-d commented 2 years ago

Is this only linux related, or might be related for Windows too?

peacey commented 2 years ago

@arvidn How do you plan to use MADV_COLD? I'm not an expert on Linux memory management, but I naively added | MADV_COLD to advise in mmap.c then recompiled libtorrent and qbt, but it didn't seem to make any difference. Using a file_pool_size of 5000 still results in download rate capped at 1/3rd rate (and 60% MEM/3000G VIRT). Setting file_pool_size to 10-40 fixes the download rate again.

Alexey104 commented 2 years ago

I don't use qbt, and file_pool_size is default(40). Still get slow downloads.

peacey commented 2 years ago

@Alexey104 Have you tried to set it to 10? Keep in mind I am not disk I/O bound since I'm using an NVMe drive. Even though the problem might be caused by the same change (mmap backend), the solution for your case might be different if you are in fact disk I/O bound.

Alexey104 commented 2 years ago

Have you tried to set it to 10?

No, I haven't yet. I am sick with corona and just reading this thread laying at bed with phone. But I'll try when I feel good enough to move my ass to the desktop machine.

arvidn commented 2 years ago

How do you plan to use MADV_COLD?

My idea (which I haven't had time to test yet) is to keep a ring-buffer of the most recently completed piece indices. i.e. the moment we have a piece, and it passes the hash check its piece number inserted into the ring-buffer.

At that point, we have already hashed the piece, so the only reason it might get read again is if a peer asked for it. This is likely to happen since we download the rarest pieces first, so once we complete a piece, the most number of other peers will be interested in it. For this reason, it should be left considered "hot" (or at least not cold) for a while. If we receive a request from a peer for this piece, it's removed from the ring buffer (so, I suppose it needs an index too). If the piece index makes it all the way through the ring buffer, it means it won't have been requested for a while, and we can then mark it as cold (with MADV_COLD), making it a candidate for eviction.

I naively added | MADV_COLD to advise in mmap.c then recompiled libtorrent and qbt, but it didn't seem to make any difference.

I wouldn't be surprised if this advice is removed the moment you access the memory, since it's no longer cold then, by definition. If you set this on all pages when you first map the file, every single page will become hot before you want them to be evicted. I don't think that approach would make a difference. You could try to advise MADV_COLD on every block that has been written. Basically, to tell the kernel to assume nobody will be interested in the data again. But, presumably, if a peer gets to requesting it before the kernel evicts it, it won't be considered cold anymore. Perhaps this would be sufficient, over my more complex algorithm above.

Alexey104 commented 2 years ago

Lowering file_pool_size and increasing aio_threads didn't make any difference for me at all.

tacohitbox commented 2 years ago

Any updates on this?

arvidn commented 2 years ago

please give this patch a try: https://github.com/arvidn/libtorrent/pull/6703

Alexey104 commented 2 years ago

@arvidn, thank you for your hard work, but in my case, unfortunately, this patch didn't make any difference. Still get at most 1/2 of my usual download rate.

arvidn commented 2 years ago

@Alexey104 that patch only makes a difference (by default) on windows. I haven't received many reports of this being a problem on linux. Once applying that patch, you also need to enable the option disk_io_write_mode = settings_pack::write_through.

That will use the MADV_PAGEOUT hint, which is new in Llinux 5.4. If you have an older kernel, that won't make a difference either.

peacey commented 2 years ago

Hi @arvidn,

EDIT: See post below. These results are not correct because I compiled RC_2_0 with the debug flag and 0 optimizations. Everything worked fine after I compiled with Release flag (which adds -O3).

So I'm seeing vastly different throughput performance in v2.0.5 vs. RC_2_0 branch. I tried your patch too and set disk_io_write_mode = settings_pack::write_through in the qbittorrent source, but it still did not help. This is on Arch Linux (kernel 5.16.2), not Windows.

~~Here is a comparison of the three cases with throughput graphs (measured over 5 minutes). All cases had 16 I/O threads, 4 hashing threads, and 10 file pool size.~~

~~1. Tag v2.0.5: Able to maintain 2.5 Gbps throughput. CPU usage for qbt process is 100-200% while downloading, but QBT web UI is still responsive while downloading at max throughput.

Branch RC_2_0: NOT able to maintain or reach 2.5 Gbps. Throughput is not stable like v2.0.5. CPU usage is 100-200% while downloading, but QBT web UI is NOT responsive (have to delete download files so it stops downloading them then I can access web GUI again).
With Patch: NOT able to maintain or reach 2.5 Gbps. Throughput is again not stable. CPU usage is 100-200% while downloading, but QBT web UI is NOT responsive. This is with disk_io_write_mode = settings_pack::write_through. Doesn't seem to make any difference compared to RC_2_0.~~

As you can see, there seems to be some problem introduced between v2.0.5 and current RC_2_0 that severely degrades the throughput, and the patch does not seem to help. The QBT web UI being unresponsive in RC_2_0 might be related to this throughput issue too.

~~Do you have any idea what could have caused this change since v2.0.5?~~

peacey commented 2 years ago

Sorry @arvidn, please disregard my previous post. I made a mistake and compiled RC_2_0 with debug flags and no optimizations, while 2.0.5 was compiled with optimizations and no debug flags. The lack of optimizations and debug info made my RC_2_0 throughput suffer.

After I recompiled RC_2_0 with the CMAKE flag -DCMAKE_BUILD_TYPE=Release which applies O3 optimizations, I no longer have any throughput issues with RC_2_0 and can reach and maintain full throughput without any drops as can be seen from this throughput graph.

Sorry for the confusion.

arvidn commented 2 years ago

heh, I was just about to suggest reviewing how the different versions were built :)

FaceCrap commented 2 years ago

@sledgehammer999 ^^ it sounds like a smaller default file_pool_size makes sense

I was hoping that would be the same for me, sadly with it set to 10, I keep getting the same slow DL speed as with the default value. I also only have on average 1 or 2 torrents actively downloading.

arvidn commented 2 years ago

I think the file pool size indirectly affects the likelihood of having many dirty pages. Presumably, a more reliable approach is to explicitly page-out dirty pages explicitly. like this patch: https://github.com/arvidn/libtorrent/pull/6703

xavier2k6 commented 2 years ago

please give this patch a try: #6703

@Alexey104 that patch only makes a difference (by default) on windows. I haven't received many reports of this being a problem on linux. Once applying that patch, you also need to enable the option disk_io_write_mode = settings_pack::write_through.

That will use the MADV_PAGEOUT hint, which is new in Llinux 5.4. If you have an older kernel, that won't make a difference either.

Does enabling disk_io_write_mode = settings_pack::write_through have any effect on Windows/macOS or is it specific to just Linux?

Rootax commented 2 years ago

Windows 10 x64 here.

On my side, top speed is the same (around 2gb/sec, the max of my internet line), but rc1.2 can maintain that all the way, while 2.0 can't. Up and down, up and down, like it's slowing down when it feel the need to write to the disk, while rc 1.2 can do both no problem...

(EDIT : on rc 1.2 I'm using a cache of only 32 mb. No such setting on rc 2.0. I don't hit any memory limit, and I ve the same behaviour with or without os cache enabled.)

arvidn commented 2 years ago

Does enabling disk_io_write_mode = settings_pack::write_through have any effect on Windows/macOS or is it specific to just Linux?

It has an effect on all major OSes. On windows it uses FlushViewOfFile() and on posix systems it uses msync()

arvidn commented 2 years ago

@Rootax it sounds like maybe the disk write buffer needs to be a bit larger in libtorrent 2.0.x. The 32 MiB in libtorrent 1.2 may still exceed that while flushing to disk, so it's possible end up using more than that in practice.

Right now, libtorrent defaults max_queued_disk_bytes to just 1 MiB. Maybe that really should be closer to 32 MiB. Would you mind trying that?

Rootax commented 2 years ago

@Rootax it sounds like maybe the disk write buffer needs to be a bit larger in libtorrent 2.0.x. The 32 MiB in libtorrent 1.2 may still exceed that while flushing to disk, so it's possible end up using more than that in practice.

Right now, libtorrent defaults max_queued_disk_bytes to just 1 MiB. Maybe that really should be closer to 32 MiB. Would you mind trying that?

Already pushed at 32mb, even tried 64mb. It helps at first (like i'm stable for a little longer), but at some point the yoyo behaviour is back.

For the record the final storage is not the problem (I try on nvme ssd and a ram drive)

xavier2k6 commented 2 years ago

Does enabling disk_io_write_mode = settings_pack::write_through have any effect on Windows/macOS or is it specific to just Linux?

It has an effect on all major OSes. On windows it uses FlushViewOfFile() and on posix systems it uses msync()

Seems in qBittorrent we still use settingsPack.set_int(lt::settings_pack::disk_io_write_mode, mode);

So perhaps we need to do something like below:

#ifdef QBT_USES_LIBTORRENT2
    settingsPack.set_int(lt::settings_pack::disk_io_write_mode, lt::settings_pack::write_through);
#else
    settingsPack.set_int(lt::settings_pack::disk_io_write_mode, mode);
#endif

Rootax commented 2 years ago

Maybe not related, but I noticed another weird behaviour on my system (win 10x 64, 32gb of ram, latest qbt/libt.

When qbt decide to write the data to the disk, the disk is reading the data too. Like, read 200mo/sec, write 200mo/sec, at the same time, until everything is flushed. I didn't see this behaviour with lib 1.2. Did the data checking change in lib 2.0, so,I don't know, data is read again before writing or something like that ?

Thx.

Alexey104 commented 2 years ago

but rc1.2 can maintain that all the way, while 2.0 can't. Up and down, up and down, like it's slowing down when it feel the need to write to the disk, while rc 1.2 can do both no problem...

I also observe this behavior. But max download speed also suffers significantly in my case. Cannot reach even half of my bandwidth with 2.0.

Rootax commented 2 years ago

When the speed is dying, I see that QBT show that the write cache is overloaded (whatever this is measuring with lib 2.x since the cache is different). Doesn't happen with lib 1.2.x.

Rootax commented 2 years ago

Maybe not related, but I noticed another weird behaviour on my system (win 10x 64, 32gb of ram, latest qbt/libt.

When qbt decide to write the data to the disk, the disk is reading the data too. Like, read 200mo/sec, write 200mo/sec, at the same time, until everything is flushed. I didn't see this behaviour with lib 1.2. Did the data checking change in lib 2.0, so,I don't know, data is read again before writing or something like that ?

Thx.

Maybe this is related to this :

https://stackoverflow.com/questions/68953829/writing-to-a-memory-mapped-file-shows-read-accesses-in-htop

"The read access occurs because as the pages are accessed for the first time they need to be read in from disk. The OS is not clarvoyant and doesn't know that the reads will be thrown out.

To avoid the issue, don't use mmap(). Build the blocks in buffer and write them out the old fashioned way."

Maybe I'm all wrong. But if this is the normal behaviour, it doesn't help the performances that's for sure...

arvidn commented 2 years ago

I recently landed this which might impact this issue: https://github.com/arvidn/libtorrent/pull/6861

apart from defaulting hashing_threads to 1, it also relaxes the hashing operations of pieces as they are being downloaded to be run on any aio_thread. The main reason to use 1 hashing thread for full file checking is for them to read sequentially from the drive.

During download, however, the pieces being checked are most likely still in the cache, and random access should be fine.

Alexey104 commented 2 years ago

@Arvid, Thank you! I will try it and report back about the results.

Alexey104 commented 2 years ago

Unfortunately, I don't see any significant improvements :( Although it seems to be a little bit better with #6861(lower), it is still far from being as good as RC_1_2(upper):

RC_1_2 RC_2_0

Rootax commented 2 years ago

Unfortunately, I don't see any significant improvements :( Although it seems to be a little bit better with #6861(lower), it is still far from being as good as RC_1_2(upper): ...

Is download speed still tanking when pieces are written to disk ?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rmawatson commented 1 year ago

Was also getting significantly slower download speeds downloading with qBitTorrent 4.5.2 using libtorrent2. Switched back to the same build of qBitTorrent with libtorrent1.2 and speeds go from ~ 11MB/s to 70-90MB/s.

stalkerok commented 1 year ago

Everyone seems to be trying to forget or not notice problems with speed, but the problem has not gone away. People who have an Internet connection above 100 Mbps suffer from this to a greater extent. Using lt2.0 greatly affects the download and upload speed, as well as many, many related problems. https://github.com/qbittorrent/qBittorrent/issues/16043

Rootax commented 1 year ago

Here I didn't forget, I just gave up on 2.x a while ago. I build what I need with 1.2 and it's working great. Great speed, no weird i/o behaviour, etc....

Le mer. 21 juin 2023, 13:08, stalkerok @.***> a écrit :

Everyone seems to be trying to forget or not notice problems with speed, but the problem has not gone away. People who have an Internet connection above 100 Mbps suffer from this to a greater extent. Using lt2.0 greatly affects the download and upload speed, as well as many, many related problems. qbittorrent/qBittorrent#16043 https://github.com/qbittorrent/qBittorrent/issues/16043

— Reply to this email directly, view it on GitHub https://github.com/arvidn/libtorrent/issues/6561#issuecomment-1600639382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFLELOXMNBPMHXGT4XU7533XMLI3ZANCNFSM5IJ6DWCA . You are receiving this because you were mentioned.Message ID: @.***>

stalkerok commented 1 year ago

I also stopped using lt2.0, but seeing that the lt1.2 version is almost not developed, I'm worried that it will soon die altogether. It has already gotten to the point where qbittorrent is dropping i2p support, since i2p is dead in lt1.2 (https://github.com/qbittorrent/qBittorrent/pull/19207). This is in addition to the fact that lt1.2 does not contain a huge number of changes and corrections besides i2p.

Pentaphon commented 1 year ago

Is this still an issue with libtorrent 2.0.9?

master255 commented 1 year ago

I'll tell you more. The speed of libtorrent 1.2 on tcp\udp torrent protocol is less than on https protocol. Use my torrent client. It's the only one in the world which supports distribution of torrents via https protocol.

arvidn / libtorrent

RC_2_0 - low download speed compared to RC_1_2 #6561