bittorrent / bittorrent.org

384 stars 100 forks source link

[Proposal] BEP56: Data compression extension #124

Open Saiv46 opened 2 years ago

Saiv46 commented 2 years ago

When Bittorrent was created, compression algorithms was slow and expensive, so user must have share uncompressed files or compress it manually to ZIP/RAR/etc.

Now people still share files in .zip archives (or even in .rar, yikes!), which takes additional space and needs to be uncompressed to somewhere. Instead, why shouldn't we compress torrent pieces on-the-fly?

We have fast compression algorithms like LZO, LZ4, Snappy and Zstandard, which allows torrenting uncompressed files directly, yet not sacrifing upload speed like if it was pre-compressed.

Saiv46 commented 2 years ago

There's compression libraries that focused on compression/decompression speed:

arvidn commented 2 years ago

Do you have any benchmarks on how well these compression algorithms work on various kinds of work-loads? It would be interesting to have some idea of how helpful these compression algorithms would be.

Do you have any recommendations on how a client would choose whether to use compression at all? (presumably, some files just have too much entropy for it to be possible to compress them). If all compression algorithms are tried for every block, and the best one is chosen, it might become quite costly.

16 kiB is quite small a buffer to compress. I would expect the size of the compression dictionary would make it difficult to gain much size. Have you considered separating the dictionary from the block buffers, to allow the same dictionary to be reused for multiple blocks.

the8472 commented 2 years ago

Now people still share files in .zip archives (or even in .rar, yikes!), which takes additional space and needs to be uncompressed to somewhere.

I think in many cases that's because some clients don't handle many small files well. Or because the source data they share already comes as an archive. I.e. compression is not necessarily the motivating factor.

The only time where it would be valuable to compress the contents is when you're dealing with large amounts of textual data. Huge SQL/JSON/CSV dumps etc., wikipedia archives and so on. But those often already are compressed at the source stage too.

Instead of baking compression into the torrent format I think a better option would be to add an extension message that can be sent instead of several PIECE messages that compresses the payload. If the data compresses well (2-3x) then the client can cram several pieces into the common message size limit. And perhaps another message to send precomputed dictionaries for particular ranges. Then this can all be precomputed by the client as auxiliary data without having to change the metadata format. It would be an opt-in wire protocol feature instead of a backwards-incompatible change.

Edit: Looks like I misunderstood and you already had on-the-wire compression in mind, I thought this was about doing it in the torrent file format similar to archive formats.

the8472 commented 2 years ago

Also, allowing more than one compression algorithm has multiple drawbacks for implementations:

If we add compression at all then imo it should only be zstd. It's tunable and supports external dictionaries.

Saiv46 commented 2 years ago

Thanks for your feedback!

Do you have any recommendations on how a client would choose whether to use compression at all? (presumably, some files just have too much entropy for it to be possible to compress them).

It's implementation-defined behaviour (some client would like to try algorithms, while others will just try to use default one). As in draft:

Some algorithms can't compress unreasonably small pieces efficiently, so
client should lower algorithm's priority during handshake, depending on
expected compression ratio for defined piece size (down to zero in worst case).

Also, allowing more than one compression algorithm has multiple drawbacks for implementations:

needs multiple dependencies

Well, early-stage implementation will mostly need to have one compression algorithm and many decompression algorithms.

different clients may end up with non-overlapping subsets of supported compressors

Compression algorithm is selected by taking dictionary item with highest
priority from intersection of items supported by both peers, if there
isn't any suitable compression algorithm - compression will be disabled.

an uploading client may have to precompute the compressed data in multiple formats due to different support levels by different downloaders

If client is caching seeding/recieved pieces in memory, then pieces should
be stored compressed, decompressing when saving to disk or sending to peer
which not supports that compression. To avoid re-compression pieces when
sending, client can specify ``cp`` field during handshake.

Client can compress data in single format and them specify it as prefered algorithm. If other side does not supports it - it'll be sent uncompressed.

If we add compression at all then imo it should only be zstd. It's tunable and supports external dictionaries.

Same, IMO. I think it'll gonna be first format that will be required by this BEP.

the8472 commented 2 years ago

Before writing a BEP I would suggest surveying existing torrents and and trying to piece-wise compress the data and see how much compression could actually be done. I doubt there's much to gain, even with optimized dictionaries.

Maybe there's some niche use-case, but that should be demonstrated.

Saiv46 commented 2 years ago

I splitted files in pieces and then compressed "naively" (no shared dictionary), so what I got:

When I wrote this BEP, I worried about compression speed. But when testing things out, the bottleneck was I/O (most people seeding from HDD).

Files:

# Type Filename Piece size
1 Text ftp.ic.rnd.mvd.ru_file_list.txt 2 MiB
2 Audio Песни сибирской каторги.flac 128 KiB
3 Audio Песни сибирской каторги.flac 2 MiB
4 Executable EVERYTHING IS GOING TO BE OK.exe 2 MiB

Compression algorithms:

# Identifier Command
0 (none)
1 lz4 lz4 -1
2 lz4_hc lz4 -9
3 d_chameleon sharc -c1
4 d_cheetah sharc -c2
5 d_lion sharc -c3

Results

# 0 1 2 3 4 5
1 28.3M 8.6M 6.9M 17M 9.2M 7.4M
2 151M 145M 144M ERR ERR ERR
3 151M 142M 142M 147M 148M 150M
4 729M 688M 685M 703M 711M 721M

Yeah, it's not very effective when file is already compressed. Some compression software seems to fail on low piece sizes (<256 KiB).

The usecase for that is to allow people to get content without decompressing it first, as well as to not compress files themself.

kasper93 commented 2 years ago

I was wondering why BT protocol never incorporated compression long time ago. Now other discussion brought this though again and I found this proposal.

The usecase for that is to allow people to get content without decompressing it first, as well as to not compress files themself.

Exactly. Currently the common way is to share already compressed version of data. For A/V it works fine, because you can share files directly, for any other content it varies greatly and most of the time it is packed in iso/zip/rar/7z/tarballs which require user do download whole thing, even if they are interested only in specific file from the archive.

Before writing a BEP I would suggest surveying existing torrents and and trying to piece-wise compress the data and see how much compression could actually be done. I doubt there's much to gain, even with optimized dictionaries.

This could be done... but the results would be bias, simply because currently BT protocol doesn't do any compression, so users do it themselves. No one will share those multiple GB of some say textual data, if compressed size is like 200MB... Adding compression into protocol could change users habits how torrent is used in those cases.

Also, allowing more than one compression algorithm has multiple drawbacks for implementations:

Fully agree. It should be one algorithm, with possibility to extend/change it in the future. https://github.com/facebook/zstd is currently good candidate. There could be handshake on protocol level, as suggested, and leave it to the clients to decide. But to improve compatibility, portability and interoperability between clients, there should be one main algorithm enforced by protocol, else we are on mercy of different implementations if the compression works across the swarm.

Few thought about about the topic:

Ok, but I'm giving lose thoughts. For more in depth discussion, I would need to get up to speed with protocol and implementations....

My main takeaway rn is that compression on protocol level would be very useful to allow more efficient sharing of more "lose" data. Currently in most cases data is just compressed into archive and shared this way, while this works, it require user to decompress the data which means more wasted disc space and more importantly shorter seed time, because it is likely that user would decompress the data and immediately remove the source archive. If it would be handled on protocol level it is more likely to keep seeding, because it is the same data which user uses and the data itself is immediately available after download. More importantly sharing the same files archived in different way partitions the swarm, without original archive it is not possible to join the swarm even if you already have not compressed files. v2 per-file-hashes will work better when shared actual data and not some archives of it. I guess I'm saying obvious things, but I want to spark discussion.

arvidn commented 2 years ago

data is transferred in 16kiB blocks, that would be the natural level to apply compression to. Consider you may have some peers not supporting the extension, you want to be able to download different parts of the same piece from both kinds of peers.

I would think the most reasonable survey would be compression at 16 kB block level. Also, I would imagine zstd would also be on the short list of algorithms to choose from.

the8472 commented 2 years ago

This could be done... but the results would be bias, simply because currently BT protocol doesn't do any compression, so users do it themselves.

The file types used and their compression ratios could be taken into account too. Also note that for example tar.xz will generally provide much better compression than piece-wise compression.

Few months ago there was 18TB NFT torrent, which made a bunch of news. But it turned out to be mostly zeros.

That seems to be a trolling? I don't think optimizing a protocol around that is worthwhile.

It would be beneficial for v2, because files significantly smaller than piece size introduce overhead with padding to piece size.

Clients aren't transferring the padding, they can request smaller-than-blocksize byte ranges.

data is transferred in 16kiB blocks, that would be the natural level to apply compression to.

Small block sizes aren't good for compression algorithms though. Possible alternatives

Saiv46 commented 2 years ago

Updated #125:

the8472 commented 2 years ago

I think this needs to be prototyped first in an existing client before making it a BEP for others to adopt.

I'm not sure how compression algorithms interact with streaming and message boundaries. I suspect the capabilities of libraries vary there. E.g. if one operates zstd in streaming mode without explicit flushing then how much extra data would one have to send in the worst case before the receiving side could decode the first message?

And a downside of TCP-level compression is that a client has to redo the compression for every stream rather than serving compressed blocks that could be cached somewhere. It's a tradeoff and one would have to test what the CPU overhead would be in different scenarios.

Another issue is that BEP 10 says:

Subsequent handshake messages can be used to enable/disable extensions without restarting the connection. If a peer supports changing extensions at run time, it should note that the m dictionary is additive. It's enough that it contains the actual CHANGES to the extension list. To disable the support for LT_metadata at run-time, without affecting any other extensions, this message should be sent: d11:LT_metadatai0ee. As specified above, the value 0 is used to turn off an extension.

So the semantics of changing the compression mid-flight have to be defined.

Saiv46 commented 1 year ago

Hi there!

After one and half year since this proposal is made. There's still a question about compression modes, which trade-offs:

  1. Piece compression + CPU-efficient (compress once) + Memory-efficient (smaller cache) - Worse compression ratio

  2. Stream compression + Better compression ratio + Simpler implementation - CPU-intensive

What would you think is more preferrable?

braiam commented 9 months ago

How would a client that is seeding for clients that both support and don't support compression behave wrt compressed data that already hit the wire. Should clients cache the compressed data to prevent compressing the same data? Or would they prefer if both support the same algorithms to send the same data?

If streams are preferred, shouldn't the seeder tell the client to make a list of pieces that it will ask next so the client can optimize the stream? What happens when there's multiple errors in data transmission? Would clients drop trying to compress and start sending data as is?

axet commented 5 months ago

What would you think is more preferable?

Optional. Software can choice dynamically depend on conditions. Big torrents with big Piece size (32MB) can be well compressed using Piece compression, small Piece size (<4MB) can request stream compression by software side and accepted by server side.

We need this. Why so slow?

Saiv46 commented 5 months ago

Updated #125:

Saiv46 commented 5 months ago

Optional. Software can choice dynamically depend on conditions. Big torrents with big Piece size (32MB) can be well compressed using Piece compression, small Piece size (<4MB) can request stream compression by software side and accepted by server side. We need this. Why so slow?

@axet Because I suck at writing specifications