Torrent search index format

lmatteis commented 7 years ago

With updatable torrents and torrent feeds we're slowly moving towards more decentralized ways for publishers to distribute large quantities of updatable content and have subscribers learn about updates.

One problem that consumers of such content face is that they usually first need to download the entire dataset, and therefore cannot search their way through it and prioritize the download of specific pieces based on their interest.

A solution to this would be to implement a B+tree structure, where nodes in the tree are the same size of the pieces of the torrent. The .torrent file would only hold the "root" of the tree.

Searching

A search would start from the root node (provided in the torrent file). Each pointer in the root node links to a piece number. Based on the search, a specific piece is download, and the process continues until the leaf node holding the data is found.

Inserting and deleting

Since we want to reuse as much data as possible and not let subscribes re-download the entire index every time some data is changed, a simple approach would be to use an append-only b+tree, also called a copy-write b+tree. The idea is that the tree is immutable and data is always appended. There's also ways to compact the tree once old portions aren't needed anymore. CouchDB also uses this structure: http://guide.couchdb.org/draft/btree.html

Use case

Imagine you're an entity such as Web Archive and want to distribute a large dump (~40gb) of torrent files. You want your users to be able to achieve keyword search to quickly find relevant content. Instead of creating a single torrent of these torrent files, you create a single "torrent search index" file which is the B+tree explained earlier. The keys in the b+tree are the words you want your users to search for, which can be extracted from the torrent files using available stemmers and analyzers.

You publish this resulting "torrent search index" as a mutable torrent (BEP46) making sure the torrent contains the root node.

Users consuming your mutable torrent will always have the latest index you publish. If they want to search for something they navigate the tree and only download pieces where the content is. Hopefully, with healthy swarms, a keyword search should provide relevant results in only a few seconds.

I was wondering whether something like this has been proposed before and what you think about it?

the8472 commented 7 years ago

The .torrent file would only hold the "root" of the tree.

And where would the rest be?

lmatteis commented 7 years ago

@the8472 other nodes of the tree would simply pieces of the torrent.

Taking for example this "flattened tree":

flattened-btree-page-structure

"root 13" is stored in the torrent. But all other nodes to its left would be pieces.

the8472 commented 7 years ago

in general pieces have to correspond to entries in the files list (or possibly a tree in the future).

lmatteis commented 7 years ago

To navigate from the root, you always request pieces from the network. So "root 13" contains two pointers to "piece 1 and piece 4". You request them from network and continue.

the8472 commented 7 years ago

I'm not asking how a tree works, that's not rocket science. My question is how the tree's nodes correspond to the file layout of a torrent. Each piece has to be part of one or more files, or if we're generous at least it must be covered by an entry in the files list which usually describes a file or file-like object.

In other words, if you intend to put some data into the piece address space (your btree index's nodes) then you must also define how those pieces are described in the file list.

ssiloti commented 7 years ago

I don't think an append-only b-tree is what you really want for this. The main advantage of append-only data structures is fast writes with rotational storage, but that's not the main concern here. What we really care about is minimizing the number of pieces modified by an update, which regular b-trees already do a good job of.

Shameless plug: You can achieve something like this by building an sqlite database of keyword-> infohash mappings and accessing it using sqltorrent to enable running queries against the partially downloaded database file.

lmatteis commented 7 years ago

@ssiloti did not know of sqltorrent. That looks very interesting. How do you handle this reuse of data across updates with an sqlite database?

lmatteis commented 7 years ago

@the8472 not sure I understand. I'm not putting any data in the piece address space.

the8472 commented 7 years ago

But you do? The pointers in a tree need to be stored somewhere. And you said

other nodes of the tree would simply pieces of the torrent.

Which means the pointers need to be stored in the piece space

ssiloti commented 7 years ago

@lmatteis You just re-check the database file against the new torrent. Sqlite already tries hard to minimize the number of sectors/pieces it has to touch during an update because it's good for performance in general. Thus most of the pieces will be the same and will not need to be re-downloaded.

lmatteis commented 7 years ago

The pointers are stored in the root. That's separate from the pieces list, given as an extra piece of info with the torrent.

When you download a piece, being a node, that contains pointers to other pieces.

the8472 commented 7 years ago

A tree contains interior pointers in non-root nodes.

bittorrent / bittorrent.org