Open Saiv46 opened 3 years ago
There's compression libraries that focused on compression/decompression speed:
Do you have any benchmarks on how well these compression algorithms work on various kinds of work-loads? It would be interesting to have some idea of how helpful these compression algorithms would be.
Do you have any recommendations on how a client would choose whether to use compression at all? (presumably, some files just have too much entropy for it to be possible to compress them). If all compression algorithms are tried for every block, and the best one is chosen, it might become quite costly.
16 kiB is quite small a buffer to compress. I would expect the size of the compression dictionary would make it difficult to gain much size. Have you considered separating the dictionary from the block buffers, to allow the same dictionary to be reused for multiple blocks.
Now people still share files in .zip archives (or even in .rar, yikes!), which takes additional space and needs to be uncompressed to somewhere.
I think in many cases that's because some clients don't handle many small files well. Or because the source data they share already comes as an archive. I.e. compression is not necessarily the motivating factor.
The only time where it would be valuable to compress the contents is when you're dealing with large amounts of textual data. Huge SQL/JSON/CSV dumps etc., wikipedia archives and so on. But those often already are compressed at the source stage too.
Instead of baking compression into the torrent format I think a better option would be to add an extension message that can be sent instead of several PIECE messages that compresses the payload. If the data compresses well (2-3x) then the client can cram several pieces into the common message size limit. And perhaps another message to send precomputed dictionaries for particular ranges. Then this can all be precomputed by the client as auxiliary data without having to change the metadata format. It would be an opt-in wire protocol feature instead of a backwards-incompatible change.
Edit: Looks like I misunderstood and you already had on-the-wire compression in mind, I thought this was about doing it in the torrent file format similar to archive formats.
Also, allowing more than one compression algorithm has multiple drawbacks for implementations:
If we add compression at all then imo it should only be zstd. It's tunable and supports external dictionaries.
Thanks for your feedback!
Do you have any recommendations on how a client would choose whether to use compression at all? (presumably, some files just have too much entropy for it to be possible to compress them).
It's implementation-defined behaviour (some client would like to try algorithms, while others will just try to use default one). As in draft:
Some algorithms can't compress unreasonably small pieces efficiently, so
client should lower algorithm's priority during handshake, depending on
expected compression ratio for defined piece size (down to zero in worst case).
Also, allowing more than one compression algorithm has multiple drawbacks for implementations:
needs multiple dependencies
Well, early-stage implementation will mostly need to have one compression algorithm and many decompression algorithms.
different clients may end up with non-overlapping subsets of supported compressors
Compression algorithm is selected by taking dictionary item with highest priority from intersection of items supported by both peers, if there isn't any suitable compression algorithm - compression will be disabled.
an uploading client may have to precompute the compressed data in multiple formats due to different support levels by different downloaders
If client is caching seeding/recieved pieces in memory, then pieces should be stored compressed, decompressing when saving to disk or sending to peer which not supports that compression. To avoid re-compression pieces when sending, client can specify ``cp`` field during handshake.
Client can compress data in single format and them specify it as prefered algorithm. If other side does not supports it - it'll be sent uncompressed.
If we add compression at all then imo it should only be zstd. It's tunable and supports external dictionaries.
Same, IMO. I think it'll gonna be first format that will be required by this BEP.
Before writing a BEP I would suggest surveying existing torrents and and trying to piece-wise compress the data and see how much compression could actually be done. I doubt there's much to gain, even with optimized dictionaries.
Maybe there's some niche use-case, but that should be demonstrated.
I splitted files in pieces and then compressed "naively" (no shared dictionary), so what I got:
When I wrote this BEP, I worried about compression speed. But when testing things out, the bottleneck was I/O (most people seeding from HDD).
# | Type | Filename | Piece size |
---|---|---|---|
1 | Text | ftp.ic.rnd.mvd.ru_file_list.txt | 2 MiB |
2 | Audio | Песни сибирской каторги.flac | 128 KiB |
3 | Audio | Песни сибирской каторги.flac | 2 MiB |
4 | Executable | EVERYTHING IS GOING TO BE OK.exe | 2 MiB |
# | Identifier | Command |
---|---|---|
0 | (none) | |
1 | lz4 | lz4 -1 |
2 | lz4_hc | lz4 -9 |
3 | d_chameleon | sharc -c1 |
4 | d_cheetah | sharc -c2 |
5 | d_lion | sharc -c3 |
# | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
1 | 28.3M | 8.6M | 6.9M | 17M | 9.2M | 7.4M |
2 | 151M | 145M | 144M | ERR | ERR | ERR |
3 | 151M | 142M | 142M | 147M | 148M | 150M |
4 | 729M | 688M | 685M | 703M | 711M | 721M |
Yeah, it's not very effective when file is already compressed. Some compression software seems to fail on low piece sizes (<256 KiB).
The usecase for that is to allow people to get content without decompressing it first, as well as to not compress files themself.
I was wondering why BT protocol never incorporated compression long time ago. Now other discussion brought this though again and I found this proposal.
The usecase for that is to allow people to get content without decompressing it first, as well as to not compress files themself.
Exactly. Currently the common way is to share already compressed version of data. For A/V it works fine, because you can share files directly, for any other content it varies greatly and most of the time it is packed in iso/zip/rar/7z/tarballs which require user do download whole thing, even if they are interested only in specific file from the archive.
Before writing a BEP I would suggest surveying existing torrents and and trying to piece-wise compress the data and see how much compression could actually be done. I doubt there's much to gain, even with optimized dictionaries.
This could be done... but the results would be bias, simply because currently BT protocol doesn't do any compression, so users do it themselves. No one will share those multiple GB of some say textual data, if compressed size is like 200MB... Adding compression into protocol could change users habits how torrent is used in those cases.
Also, allowing more than one compression algorithm has multiple drawbacks for implementations:
Fully agree. It should be one algorithm, with possibility to extend/change it in the future. https://github.com/facebook/zstd is currently good candidate. There could be handshake on protocol level, as suggested, and leave it to the clients to decide. But to improve compatibility, portability and interoperability between clients, there should be one main algorithm enforced by protocol, else we are on mercy of different implementations if the compression works across the swarm.
Few thought about about the topic:
Ok, but I'm giving lose thoughts. For more in depth discussion, I would need to get up to speed with protocol and implementations....
My main takeaway rn is that compression on protocol level would be very useful to allow more efficient sharing of more "lose" data. Currently in most cases data is just compressed into archive and shared this way, while this works, it require user to decompress the data which means more wasted disc space and more importantly shorter seed time, because it is likely that user would decompress the data and immediately remove the source archive. If it would be handled on protocol level it is more likely to keep seeding, because it is the same data which user uses and the data itself is immediately available after download. More importantly sharing the same files archived in different way partitions the swarm, without original archive it is not possible to join the swarm even if you already have not compressed files. v2 per-file-hashes will work better when shared actual data and not some archives of it. I guess I'm saying obvious things, but I want to spark discussion.
data is transferred in 16kiB blocks, that would be the natural level to apply compression to. Consider you may have some peers not supporting the extension, you want to be able to download different parts of the same piece from both kinds of peers.
I would think the most reasonable survey would be compression at 16 kB block level. Also, I would imagine zstd would also be on the short list of algorithms to choose from.
This could be done... but the results would be bias, simply because currently BT protocol doesn't do any compression, so users do it themselves.
The file types used and their compression ratios could be taken into account too. Also note that for example tar.xz
will generally provide much better compression than piece-wise compression.
Few months ago there was 18TB NFT torrent, which made a bunch of news. But it turned out to be mostly zeros.
That seems to be a trolling? I don't think optimizing a protocol around that is worthwhile.
It would be beneficial for v2, because files significantly smaller than piece size introduce overhead with padding to piece size.
Clients aren't transferring the padding, they can request smaller-than-blocksize byte ranges.
data is transferred in 16kiB blocks, that would be the natural level to apply compression to.
Small block sizes aren't good for compression algorithms though. Possible alternatives
Updated #125:
I think this needs to be prototyped first in an existing client before making it a BEP for others to adopt.
I'm not sure how compression algorithms interact with streaming and message boundaries. I suspect the capabilities of libraries vary there. E.g. if one operates zstd in streaming mode without explicit flushing then how much extra data would one have to send in the worst case before the receiving side could decode the first message?
And a downside of TCP-level compression is that a client has to redo the compression for every stream rather than serving compressed blocks that could be cached somewhere. It's a tradeoff and one would have to test what the CPU overhead would be in different scenarios.
Another issue is that BEP 10 says:
Subsequent handshake messages can be used to enable/disable extensions without restarting the connection. If a peer supports changing extensions at run time, it should note that the m dictionary is additive. It's enough that it contains the actual CHANGES to the extension list. To disable the support for LT_metadata at run-time, without affecting any other extensions, this message should be sent: d11:LT_metadatai0ee. As specified above, the value 0 is used to turn off an extension.
So the semantics of changing the compression mid-flight have to be defined.
Hi there!
After one and half year since this proposal is made. There's still a question about compression modes, which trade-offs:
Piece compression + CPU-efficient (compress once) + Memory-efficient (smaller cache) - Worse compression ratio
Stream compression + Better compression ratio + Simpler implementation - CPU-intensive
What would you think is more preferrable?
How would a client that is seeding for clients that both support and don't support compression behave wrt compressed data that already hit the wire. Should clients cache the compressed data to prevent compressing the same data? Or would they prefer if both support the same algorithms to send the same data?
If streams are preferred, shouldn't the seeder tell the client to make a list of pieces that it will ask next so the client can optimize the stream? What happens when there's multiple errors in data transmission? Would clients drop trying to compress and start sending data as is?
What would you think is more preferable?
Optional. Software can choice dynamically depend on conditions. Big torrents with big Piece size (32MB) can be well compressed using Piece compression, small Piece size (<4MB) can request stream compression by software side and accepted by server side.
We need this. Why so slow?
Updated #125:
Optional. Software can choice dynamically depend on conditions. Big torrents with big Piece size (32MB) can be well compressed using Piece compression, small Piece size (<4MB) can request stream compression by software side and accepted by server side. We need this. Why so slow?
@axet Because I suck at writing specifications
When Bittorrent was created, compression algorithms was slow and expensive, so user must have share uncompressed files or compress it manually to ZIP/RAR/etc.
Now people still share files in .zip archives (or even in .rar, yikes!), which takes additional space and needs to be uncompressed to somewhere. Instead, why shouldn't we compress torrent pieces on-the-fly?
We have fast compression algorithms like LZO, LZ4, Snappy and Zstandard, which allows torrenting uncompressed files directly, yet not sacrifing upload speed like if it was pre-compressed.