anacrolix / torrent

Full-featured BitTorrent client package and utilities
Mozilla Public License 2.0
5.39k stars 615 forks source link

Optimize memory usage by avoiding intermediate buffer in message serialization #928

Closed Zerolzj closed 2 months ago

Zerolzj commented 2 months ago

This commit replaces the use of an intermediate buffer in the message serialization process with a direct write-to-buffer approach. The original implementation used MustMarshalBinary() which involved an extra memory copy to an intermediate buffer before writing to the final writeBuffer, leading to high memory consumption for large messages. The new WriteTo function writes message data directly to the writeBuffer, significantly reducing memory overhead and CPU time spent on garbage collection.

Zerolzj commented 2 months ago

Optimization Documentation Summary

Experimental Environment:

CPU Usage Comparison Table

Component Pre-Optimization CPU Usage Post-Optimization CPU Usage
peerRequestDataReader 18.2% 33.5%
messageWriterRunner 20.7% 12.2%
gc (Garbage Collection) 51.4% 37.6%
mcall 7.4% 15.1%

Memory Allocation and Transfer Rate Comparison Table

Metric Pre-Optimization Post-Optimization
Memory Allocated 160GB 88GB
Measurement Interval 25 seconds 25 seconds
Peak Transfer Rate 120MB/s 185MB/s

Based on the aforementioned experimental environment and test results, the following conclusions can be drawn regarding the improvements:

  1. The increased CPU usage of peerRequestDataReader not only reflects more efficient data handling capabilities, but also indicates that more CPU resources can now be allocated to reading data as opposed to being consumed by garbage collection processes.

  2. The reduced CPU consumption for messageWriterRunner indicates a minimization of resource overhead during message serialization.

  3. A significant reduction in CPU usage for garbage collection demonstrates enhanced memory allocation and recovery efficiency.

  4. The reduction in memory usage from 160GB to 88GB highlights a more effective use of memory resources, contributing to a lower overall system load.

  5. A substantial improvement in peak transfer rates, from 120MB/s to 185MB/s, underscores a notable enhancement in overall data transfer performance post-optimization.

In summary, the experimental results indicate that the implemented optimizations have succeeded in reducing memory consumption while simultaneously boosting data transmission performance. This represents a significant performance optimization within the BitTorrent client. Future efforts will focus on further optimization to ensure continued enhancement of speed while further reducing resource consumption.

Zerolzj commented 2 months ago

Attachment: 

Performance Analysis and Upload Throughput Imagery Comparison Table

Metric Before Optimization After Optimization
CPU pprof Image Before_Optimization_CPU_pprof_Image After_Optimization_CPU_pprof_Image
MEM pprof Image Before_Optimization_MEM_pprof_Image After_Optimization_MEM_pprof_Image
Upload Bps Image Before_Optimization_Upload_bps_Image After_Optimization_Upload_bps_Image
anacrolix commented 2 months ago

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

Zerolzj commented 2 months ago

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

Thanks for your feedback. I'll update the community when I have some benchmark results to share, and conduct a detailed analysis to assess the benefits of precomputing the message length.

anacrolix commented 2 months ago

Thanks!

Zerolzj commented 2 months ago

Benchmarking Report

Performed on a macOS system equipped with an Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz and utilizing an amd64 architecture.

Benchmark Results

Below is a tabulated summary showing how each function performs with various piece lengths:

PieceLength Function Iterations Time (ns/op) Memory (B/op) Allocations (allocs/op)
1M BenchmarkWriteToBuffer1M 4323 263913 988 4
BenchmarkMarshalBinaryWrite1M 1980 598797 2114204 7
4M BenchmarkWriteToBuffer4M 4534 276714 942 4
BenchmarkMarshalBinaryWrite4M 529 2208230 8413078 7
8M BenchmarkWriteToBuffer8M 4130 274931 1033 4
BenchmarkMarshalBinaryWrite8M 240 4268217 16828753 7
Zerolzj commented 2 months ago

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

The new commit has removed the code for manually growing the buffer. The benefit of handling the buffer growth manually appears to be marginal.

anacrolix commented 2 months ago

Thanks, I'll check it out soon.

anacrolix commented 2 months ago

I made some tweaks to the benchmarks to make them more consistent. In particular, one form was always using the 4M length. I also added the most common size actually in use (16KiB, the default block size).

I think a few types have gone missing in the changes, I'll fix those up too and likely merge shortly.

anacrolix commented 2 months ago

The start and stop of the timer actually caused issues with the timing, as the reset is too small to have any effect on the benchmark.

anacrolix commented 2 months ago

I reduced the changes a bit to reuse the old payload writing code, since HashRequest was missing.

The MarshalBinary performance is doubled, but the write to buffer stuff is not quite as fast (but still 16x faster than it was).

I'll merge what I have so far, but if you want to make tweaks to reduce the allocations further to get the extra 50% back, I think the code base is in a better position to take those changes, and it will be clearer what is being done to get that.

I should add the allocated space is massively reduced. Sorry if the changes were pretty heavy handed, but I think restructuring the write to buffer, and then doing smaller cleanups in a separate change is less risky.

Zerolzj commented 2 months ago

I reduced the changes a bit to reuse the old payload writing code, since HashRequest was missing.

The MarshalBinary performance is doubled, but the write to buffer stuff is not quite as fast (but still 16x faster than it was).

I'll merge what I have so far, but if you want to make tweaks to reduce the allocations further to get the extra 50% back, I think the code base is in a better position to take those changes, and it will be clearer what is being done to get that.

I should add the allocated space is massively reduced. Sorry if the changes were pretty heavy handed, but I think restructuring the write to buffer, and then doing smaller cleanups in a separate change is less risky.

I apologize for the oversight with the HashRequest; it was unintentional and I regret any inconvenience caused. Also, I'm grateful for your thorough work on the updates. Collaborating with you has been incredibly enlightening, and I look forward to doing so again.