Zerolzj commented 2 months ago

This commit replaces the use of an intermediate buffer in the message serialization process with a direct write-to-buffer approach. The original implementation used MustMarshalBinary() which involved an extra memory copy to an intermediate buffer before writing to the final writeBuffer, leading to high memory consumption for large messages. The new WriteTo function writes message data directly to the writeBuffer, significantly reducing memory overhead and CPU time spent on garbage collection.

Zerolzj commented 2 months ago

Optimization Documentation Summary

Experimental Environment:

System:
- Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Seed Pod Configuration:
- CPU Limit: 1 core
- CPU Request: 1 core
Transfer Setup: 1-to-1 data transmission ; Piece Size 4MiB
Observation Metrics: CPU and memory usage of the uploading Pod (Seeder)

CPU Usage Comparison Table

Component	Pre-Optimization CPU Usage	Post-Optimization CPU Usage
peerRequestDataReader	18.2%	33.5%
messageWriterRunner	20.7%	12.2%
gc (Garbage Collection)	51.4%	37.6%
mcall	7.4%	15.1%

Memory Allocation and Transfer Rate Comparison Table

Metric	Pre-Optimization	Post-Optimization
Memory Allocated	160GB	88GB
Measurement Interval	25 seconds	25 seconds
Peak Transfer Rate	120MB/s	185MB/s

Based on the aforementioned experimental environment and test results, the following conclusions can be drawn regarding the improvements:

The increased CPU usage of peerRequestDataReader not only reflects more efficient data handling capabilities, but also indicates that more CPU resources can now be allocated to reading data as opposed to being consumed by garbage collection processes.
The reduced CPU consumption for messageWriterRunner indicates a minimization of resource overhead during message serialization.
A significant reduction in CPU usage for garbage collection demonstrates enhanced memory allocation and recovery efficiency.
The reduction in memory usage from 160GB to 88GB highlights a more effective use of memory resources, contributing to a lower overall system load.
A substantial improvement in peak transfer rates, from 120MB/s to 185MB/s, underscores a notable enhancement in overall data transfer performance post-optimization.

In summary, the experimental results indicate that the implemented optimizations have succeeded in reducing memory consumption while simultaneously boosting data transmission performance. This represents a significant performance optimization within the BitTorrent client. Future efforts will focus on further optimization to ensure continued enhancement of speed while further reducing resource consumption.

Zerolzj commented 2 months ago

Attachment:

Performance Analysis and Upload Throughput Imagery Comparison Table

Metric	Before Optimization	After Optimization
CPU pprof Image
MEM pprof Image
Upload Bps Image

anacrolix commented 2 months ago

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

Zerolzj commented 2 months ago

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

Thanks for your feedback. I'll update the community when I have some benchmark results to share, and conduct a detailed analysis to assess the benefits of precomputing the message length.

anacrolix commented 2 months ago

Thanks!

Zerolzj commented 2 months ago

Benchmarking Report

Performed on a macOS system equipped with an Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz and utilizing an amd64 architecture.

Benchmark Results

Below is a tabulated summary showing how each function performs with various piece lengths:

PieceLength	Function	Iterations	Time (ns/op)	Memory (B/op)	Allocations (allocs/op)
1M	BenchmarkWriteToBuffer1M	4323	263913	988	4
	BenchmarkMarshalBinaryWrite1M	1980	598797	2114204	7
4M	BenchmarkWriteToBuffer4M	4534	276714	942	4
	BenchmarkMarshalBinaryWrite4M	529	2208230	8413078	7
8M	BenchmarkWriteToBuffer8M	4130	274931	1033	4
	BenchmarkMarshalBinaryWrite8M	240	4268217	16828753	7

Zerolzj commented 2 months ago

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

The new commit has removed the code for manually growing the buffer. The benefit of handling the buffer growth manually appears to be marginal.

anacrolix commented 2 months ago

Thanks, I'll check it out soon.

anacrolix commented 2 months ago

I made some tweaks to the benchmarks to make them more consistent. In particular, one form was always using the 4M length. I also added the most common size actually in use (16KiB, the default block size).

I think a few types have gone missing in the changes, I'll fix those up too and likely merge shortly.

anacrolix commented 2 months ago

The start and stop of the timer actually caused issues with the timing, as the reset is too small to have any effect on the benchmark.

anacrolix commented 2 months ago

I reduced the changes a bit to reuse the old payload writing code, since HashRequest was missing.

The MarshalBinary performance is doubled, but the write to buffer stuff is not quite as fast (but still 16x faster than it was).

I'll merge what I have so far, but if you want to make tweaks to reduce the allocations further to get the extra 50% back, I think the code base is in a better position to take those changes, and it will be clearer what is being done to get that.

I should add the allocated space is massively reduced. Sorry if the changes were pretty heavy handed, but I think restructuring the write to buffer, and then doing smaller cleanups in a separate change is less risky.

Zerolzj commented 2 months ago

I reduced the changes a bit to reuse the old payload writing code, since HashRequest was missing.

The MarshalBinary performance is doubled, but the write to buffer stuff is not quite as fast (but still 16x faster than it was).

I'll merge what I have so far, but if you want to make tweaks to reduce the allocations further to get the extra 50% back, I think the code base is in a better position to take those changes, and it will be clearer what is being done to get that.

I should add the allocated space is massively reduced. Sorry if the changes were pretty heavy handed, but I think restructuring the write to buffer, and then doing smaller cleanups in a separate change is less risky.

I apologize for the oversight with the HashRequest; it was unintentional and I regret any inconvenience caused. Also, I'm grateful for your thorough work on the updates. Collaborating with you has been incredibly enlightening, and I look forward to doing so again.

anacrolix / torrent

Optimize memory usage by avoiding intermediate buffer in message serialization #928

Optimization Documentation Summary

Benchmarking Report

Benchmark Results