Improving BEP 38 - Githubissues

JohnDoee commented 7 years ago

I don’t like how BEP 38 solves the problem of finding local data and have a suggestion on how I think it could be improved.

My problem is that it is not easy to index all your data and compare it with torrents without actually having all the torrents readily available. It is also not possible to index files when using piece hashes as hints due to alignment.

Add a new key to the torrent called “filehints” with a list of dictionaries similar to the “files” key from the info dictionary. The list contains dictionaries with the following keys:

filename - Name of the file, full path is not really needed as a hit
length - Size of file
start hash - sha1 hash of the first 2048 bytes
end hash - sha1 hash of the last 2048 bytes

If the file is less than 4096 bytes, the end hash is not necessary. The idea with using both start and end hash is to discover files potentially changed in size or changed at the beginning. Also prevents problems with files potentially starting with the same bytes. I’m not sure if this should be extended to contain hashes of bytes from the middle of the file.

With this it is possible to create a file containing a collection of different filehints and compare them with indexed data, fast. Then the user can find the torrents they can easily seed and files they can reuse. My hope with standardizing is to get sites to build databases of their filehints.

The focus is a bit different but I honestly think it's better overall than BEP 38 even for its current use-case. Partly due to my suggestion being more stateless and an exact and easy-to-implement solution.

the8472 commented 7 years ago

There's BEP 47 for per-file hashes

JohnDoee commented 7 years ago

BEP 47 seems to serve torrent-to-torrent matching more than torrent-to-data matching. Indexing data would also mean reading all the data you got stored to get hashes, also no partial matching possible.

Additionally, you can't attach it later because the hashes are in the info dict. BEP 47 really seems to solve a very different problem overall.

What I want is an efficient data <-> torrent matching.

the8472 commented 7 years ago

The available hashes are sufficient for matching, you just need to apply them intelligently. Take a look at my torrent-reindex application. And that doesn't use BEP47 yet, although it could benefit from it in some edge-cases.

BEP 47 seems to serve torrent-to-torrent matching

no, not really.

JohnDoee commented 7 years ago

I built a very similar tool - the problem I'm trying to solve here is discovery of those potentially matching torrents without having to download e.g. 100,000 torrents and instead download a smaller file with discovery information for those 100,000 torrents.

I'm not sure I can change your mind in any way though, you seem very set on how BEP 47 solves it. The efficiency and encroachment of BEP 47 on the info dict makes the solution unusable for me. The data NEEDS to be optional to remove and add the same way announcement URLs are. I can't see a reason why it should not be.

the8472 commented 7 years ago

Ah, that's a very specific use-case for which I think neither BEP38 nor BEP47 have been designed and I don't think most users care about because they do not have access to 100k torrents in the first place.

Still, someone has to do the hashing anyway, which means someone has to have access to the payload data. Which means they can just run the matching logic anyway and calculate a torrent-contains-same-data index.

the8472 commented 7 years ago

The data NEEDS to be optional to remove and add the same way announcement URLs are.

Hashes don't change. So once it is inserted, why would you want to remove it?

JohnDoee commented 7 years ago

Ah, that's a very specific use-case for which I think neither BEP38 nor BEP47 have been designed and I don't think most users care about because they do not have access to 100k torrents in the first place.

For me it's just as much about efficiency in general and the ability to find partially matching files.

Hashes don't change. So once it is inserted, why would you want to remove it?

From a purist point of view attr and symlink path should be in the info dict because it modifies the output. sha1 does not modify or verify the output, i.e. passive and optional data for the actual operation and result of the torrent. I'm not sure if that's the actual "rule" though.

From an adoption point of view, you'll need to redo all your torrents to make them compatible with it. Then everyone who currently seeds the old torrent need to replace it with the new one.

With my suggestion anyone can glue on the hints and distribute the same torrents with hints. In the same way they can add new trackers to get peers from a new places.

the8472 commented 7 years ago

For me it's just as much about efficiency in general and the ability to find partially matching files.

Are partially matching files a common occurrence? And is it a practically useful to attempt to reuse them?

Also, have you considered that headers/tails can get modified by metadata-editing things or entirely rewritten by tools like mkvmerge, even if the interior bitstream stays the same? So your 2k prefix/suffix scheme would attempt to match those parts that are most likely to be modified.

You would need a different scheme, such as hashing of content-based slices to match arbitrary chunks of partially modified data. rsync and ipfs do this.

With my suggestion anyone can glue on the hints and distribute the same torrents with hints.

Yes, but that does not really match the observed lifecycle of torrents. They are created once and very rarely modified. Even less often modified by end-users who already have the data (which would be needed to do the hashing).

Could you describe an end-to-end process how the whole scheme is supposed to work?

From a purist point of view attr and symlink path should be in the info dict because it modifies the output. sha1 does not modify or verify the output, i.e. passive and optional data for the actual operation and result of the torrent. I'm not sure if that's the actual "rule" though.

It's not that simple. The major disadvantage of placing something outside the info dict is that it will not be distributed by BEP 9.

From an adoption point of view, you'll need to redo all your torrents to make them compatible with it. Then everyone who currently seeds the old torrent need to replace it with the new one.

Well, the goal is to simply have hashes included in all new torrents. Sites which rely on dedup-behavior could mandate their presence for uploaded torrents. That would also accelerate adoption.

JohnDoee commented 7 years ago

Are partially matching files a common occurrence? And is it a practically useful to attempt to reuse them?

I added support for this in my AutoTorrent script, it'll try to match files from the head and tail using pieces. I think it reads 10% from each end of the file and need a match bigger than some percent. It can then rewrite the file if it matches from tail but has the wrong size. This feature was added by request although I’m not sure how used it is.

The salvaging of data is also part of BEP 38.

You would need a different scheme, such as hashing of content-based slices to match arbitrary chunks of partially modified data. rsync and ipfs do this.

The finalized idea would probably contain a list of hashes similar to the pieces key where half are generated from the head of the file and half are generated from the tail of the file.

E.g. Given a 10GB file, the hashes list could be generated something like: Read from 0 bytes, 1 MB, 10MB, 100MB, 1GB, 3GB - this is done relative to both ends of the file. That way the size would not influence matching directly. Each hash is calculated from 2048 bytes read at the seek points (or some other small amount of bytes).

The important point here is that it has all the value of the sha1 key from BEP 47 with no downsides.

Yes, but that does not really match the observed lifecycle of torrents.

That’s where I think we are trying to solve a problem for different groups of people. The question is if it’s possible to reach an unified solution that’ll make everyone happy.

The two benefits I want to see is the ability to load up a torrent and the data locally stored shows up and being able to distribute a bundle with hints.

Could you describe an end-to-end process how the whole scheme is supposed to work?

I think I am solving the same problem as BEP 47 but with more features.

It's not that simple. The major disadvantage of placing something outside the info dict is that it will not be distributed by BEP 9.

BEP 9 didn’t even cross my mind, the question is if it could be optional where you are be able to put it because then it’d solve to both our problems. My biggest issue with putting it in the info dict is the lack of backward compatibility and the inability for third-parties to generate it.

Anyways, let me take another stab at this suggestion. I’m not sure how likely it is to do away with the sha1 key though.

Replace the sha1 key from BEP 47 with an “sha1 list” key. The new key contains hashes for small segments of the file, the location of the segments is derived from file-size. Like the sha1 key, they’re just hints and the client must ensure the data conforms to the actual pieces.

Allow the information to be placed outside the info dict for backward compatibility, although it is preferred for new torrents to contain it within the info dict.

Would something like that be doable? It does not take away any functionality, only extends and improves. This would depend on how widely used BEP 47 (and BEP 49, I guess) is.

Benefits over the sha1 key being: faster local data indexing, possible to find partial files and backward compatibility at the cost of modifying two BEPs slightly.

the8472 commented 7 years ago

I still fail to see the big picture here. How would this be used?

Of course I can conjure up some hypothetical scenarios where such data can be put to use, but they seem impractical for most users. File reuse in itself already is not that common, i.e. for most users downloaded data is 100% original content.

Deduplication, merging bulk torrents and mutable torrents also generally seem to be something that can be achieved by available means. So your solution would only be helpful in cases that go beyond that.

You mention sites building file hint databases, but that seems kinda fuzzy to me. Could you clarify how the whole thing is supposed to work? From torrent creator to end user, including intermediate steps.

Only if we have a good description of the use-case can we assess whether the technical solution fits well.

Now to the individual points:

This feature was added by request although I’m not sure how used it is.

Well, for local reuse this seems entirely plausible. But locally you already have the torrents with piece-hashes, so no extensions would be needed for that purpose.

Read from 0 bytes, 1 MB, 10MB, 100MB, 1GB, 3GB - this is done relative to both ends of the file. That way the size would not influence matching directly. Each hash is calculated from 2048 bytes read at the seek points (or some other small amount of bytes).

This does not work reliably because it is highly sensitive to alignment. Alignment changes when metadata is rewritten or multiple files get concatenated or put into an uncompressed archive. That's why I mentioned content-based slicing, which is the canonical solution to this problem.

Additionally, most of this information can already be synthesized from piece hashes. You can essentially boil a torrent down to hints in the form of "if N bytes at offset X in file of size Y matches hash H then it probably belongs to torrent T".

Replace the sha1 key from BEP 47 with an “sha1 list” key.

Existing clients already add sha1 key. BEP47 basically documents existing behavior. So just changing the BEP is not a backwards-compatible thing.

Allow the information to be placed outside the info dict for backward compatibility

That's something that could be added, but as mentioned above, I'd still like to know how people would actually use it.

Benefits over the sha1 key being: faster local data indexing

SHA1s only have to be calculated once and could be kept in a database. So if performance is a concern there already is a big optimization potential that does not require any standards changes.

JohnDoee commented 7 years ago

The specific situation I want to improve upon is the fragmentation in the private torrent tracker community. Many users are on multiple sites and the sites often serve almost the same content. It is a hassle for the end-user to find the correct torrent files matching their local content.

Adaption and usage should go somewhat like this: Users of the sites start creating hints and gathering them, this’ll allow new users the ability to give back without much effort on their part. The collection of hints would contain a mapping to retrieval information for the correct torrent. Hints take up a lot less space than the torrents themselves.

Creators of new torrents can add the hints, sites can embrace the hints and add the hints to their RSS feed to increase the number of seeders faster. Sites can distribute a collection of hints of torrents currently in need of seeders.

Adding the information to the torrent is mostly a convenient way to transfer it.

Using hashes instead of just filename+filesize is a lot more reliable and easier to implement.

Additionally, most of this information can already be synthesized from piece hashes.

Yes, sadly alignment is not very common.

SHA1s only have to be calculated once and could be kept in a database. So if performance is a concern there already is a big optimization potential that does not require any standards changes.

The solution you’re hinting at is using the sha1 as it currently is. My problem is that I don’t agree with you on it being performant.

In real world scenarios, the only time I’d use an SHA1 for the whole file for performance would be if I have an array of files with the same size and file names unusable for comparing where the resulting combinations with torrents surpasses 1000 i.e. the amount of data read to compare with piece hashes is bigger than the total size of these files. And then, only on demand.

This makes the hashes a second-class citizen in my eyes and doesn’t remove complexity as I was also hoping for.

A quick test shows it to be more than 100 times faster to just sample the file compared to a full-file hash.

To cover a not-so-related point: Content-based slice hashing feels like something that would require a small average window size to actually be useful, at least significantly smaller than piece size. Although I did not stumble upon any real-world numbers.

That seems to be something beyond the scope of what I want to achieve here and I used the example more as added benefits than actual goal.

Moving forward, by now it basically boils down to: “Can I use a different hash function ?” - I do understand reluctance to add more hash functions serving almost equal purpose but I hope I argued my distaste for the full-file hash.

the8472 commented 7 years ago

Users of the sites start creating hints and gathering them

This is still very handwavy. How would that work?

How would users use it? Just download the whole collection of hints?

JohnDoee commented 7 years ago

Yes, just download all the hints or a large subset of them. They should be small enough to allow for that.

I have experimented a bit with building a specific endpoint to submit and fetch them but it feels quite implementation-specific. Torrent files doesn't have a defined way to be moved around either.

the8472 commented 7 years ago

For that approach I don't think you need any modifications at all. I would do it approximately like that:

acquire torrents
look for the largest file(s) inside the torrent, they're generally the most interesting ones
find the first piece hash that completely sits inside that file
store that information as a (infohash, filesize, piece offset, piece length, piece hash) tuple. (optional) add file sha1 if it's in the torrent
create a collection of such tuples

For 1 million torrents this would be a 56MB hint file, assuming an optimal binary layout.

That way you only need to download the first few pieces of a few large files and then you can do your dedup-scan based on that.

bittorrent / bittorrent.org

Improving BEP 38 #52