Redesign of the Search/Channels feature

ichorid commented 6 years ago

Use cases.

We strive to provide the same kind of service that is provided by:

Highly organized repositories of torrents (trackers)
Scientific articles repositories.
Music playing services, like Spotify.
Video playing services, like Youtube.
Scanned books archives, like Google Books.
Internet libraries

On fuzzy searches.

We do not try to become a distributed Google. Google is good at finding the knowledge that is relevant to words. It is used when the user does not know what he or she is looking for. Our user always knows what he or she wants. Therefore, we should disable metadata (or torrent contents) search by default, and only search by the torrent name. On the other hand, the user should be able to use metadata to refine the search results. Like, "sort by date of creation" or "sort by tag/channel", etc. This model already works perfectly for all use cases indicated above (except Youtube).

Data structurization.

We must be able to replicate the basic information organization structures that are used by our example use cases:

Tracker forum : tree-like structure, traversable both up and down. When the user finds a torrent/channel, he or she should be able to see its immediate parent/children nodes, and the "root" of the hierarchy. Thus, nested channels.
- Scientific articles database: a single record in the database should be able to hold an arbitrary amount of metadata fields, and each field must be able to be used as a base for a hierarchy. For example, top-level owner/channel "ScientificArticles" could provide "pseudochannels" for "authors" "journals" "years" "tags", etc. These are essentially selecting the metadata field that would be used to build the structure "just in time". Advanced users would be able to select several criteria at once, to find the intersection of metadata sets. Thus nested channels can fully exploit the features of the underlying relational database.
Music search: should be fast and simple. This could be only achieved by implementing a distributed cloud-based search, with some data cached at user's device. We can't get new user do download XX Gigabytes search database, especially for mobile usage (that is basically everything for music playing). One search query should take 5-30 seconds, and it's results should include related info (like, the search for a single song should get that song and reveal the link to the whole album). To solve this problem, at some point we could introduce search mining. Fast hosts producing highly relevant search results get credit.
- Libraries and book archives fall somewhere between scientific articles and tracker forum.

It is important to note that a single instance of the system is not required to serve all of these features at once. Instead, it should provide building blocks, or modules, to compose the use cases as necessary. For an average user a single database could be used to serve every type of content, but, e.g., users interested in building a huge scientific database should be able to set up a separate database instance optimized for scientific articles.

Constraints on design choices

The design should consist of several completely independent parts:

Networking subsystem
- Data transfer protocol
- Trustworthy gossip protocol
- Storage subsystem
- Database/store
- Cache management / mining algorithm
- Search subsystem
- Distributed search protocol
- Distributed search algorithm
- Data distribution algorithm
- Metadata format

It is very important that we do not make any of these parts depend on specific implementation of one another. We must be able to design the system in such a way, that it would be trivial to exchange one implementation of some part for another one.

User experience requirements:

Search results should appear faster than the user can process them.

Implementation requirements:

Scalability (should only sync neighbors, and not the whole network)

synctext commented 6 years ago

Mostly duplicate of #2455

ichorid commented 6 years ago

Frequency distribution of words in all natural languages follow Zipf's law. Speakers generally optimize the term (word, phrase ...) that they assign to the denotate (the physical object/idea ...) for efficiency of communication. I.e. when I ask someone where can I find "John Doe", the combination of words "John"+"Doe" uniquely identifies the person by that name, at least in the context of the conversation. On the other hand, if I ask where can I find "blue table", this could mean either "a blue table" or "the blue table".

"A blue table" is a general search query, and the person would probably point me to a furniture store, or try to sell me a table (if he is in the furniture buisness). "A blue table" is a general search query, Google-style.
"The blue table" is an exact search, and the person would probably be confused if there is no special thing in her mind that is denoted as "the blue table". Otherwise, the person would point me to it directly. "The blue table" is a search for the exact infohash (or some special search token), or search in local context.

So, the best thing about natural languages is that humans use them to name things, like movies, game, music, torrents... This means that when someone does a search query for a thing he already knows exists, he addresses it by the terms that identify that thing uniquely enough in the database of human knowledge. And when someone just wants something in some genre - he names it by the name of the genre. Everything in between ("that movie with this guy with cool beard") we leave to Google. This means we could use the same search mechanism for both exact queries ("Bla-bla-avenger-name 2035") and for genre search ("western 2035").

ichorid commented 6 years ago

FASD: A Fault-tolerant, Adaptive, Scalable, Distributed Search Engine (2003 master thesis) - almost exactly what I came up with. No works by that author afterward, though :disappointed:

Associative search in peer to peer networks: Harnessing latent semantics (2006)

A nice survey on the topic of decentralized search: "Survey of Research towards Robust Peer-to-Peer Networks: Search Methods"

Hmmm... This stuff seems to be beaten to death in the first decade of 00`s. Still, no one truly succeeded.

ichorid commented 6 years ago

Design details for the new Channels/AllChannel/Search subsystem

Class design

All 3 communities inherit from base Metadata Community, that implements ORM and basic metadata query messages:

Community	Messages	Message handling
:european_castle: Metadata ("Chant")	(1): remote query for metadata based on some constraints; (2): answer to (1) with metadata from the local store	Can optionally include filters to indicate what not to send.
:snake: AllChannel ("Snakes")	(1),(2)	Will ask for metadata elements with high popularity score and/or top-level metadata channels.
:bear: Channels ("Bears")	(1),(2)	Will ask for recently added metadata elements with specific tags.
:bird: Search ("Birds")	(1),(2); (3): answer to (1) by introducing some other node	Could result in "long walks" according to some algorithm.

The database interface is implemented with a separate MetadataStore Pony ORM-based Tribler module. The base Metadata Community class implements basic messages for asking a remote peer for metadata element based on some criteria. Child communities differ mostly by the types of reaction to metadata queries, thus creating different search domains. The channel data is downloaded in the form of a torrent stuffed with metadata in Tribler Serializer (Python Struct) format. A metadata query is sent in the form of a simple logical expression based on tags and ranges.(TBD)

Metadata

Metadata entries are encapsulated in signed payloads and stored in a serialized format according to the following scheme:

- Metadata type ID
- The metadata author's public key
- The pointer into this metadata's author's TrustChain entry documenting its creation
- - Metadata ...
- - ...
- - ...
The gossip author's signature (signs everything above)

Basically, the signature and the author's public key form a trustworthy gossip container. Anything could be put in there, but we use it for packing various formats of metadata.

The primary metadata format for a torrent is described below:

Field	Type	Required
ID	pointer	Yes
Torrent infohash	infohash	Yes
Torrent size	long int	Yes
Torrent date	date	Yes
Torrent title	string	No
Tags	string	No
Tracker info	string	No

Notes on format:

"Tags" are search tokens, that can include, for example, hash of upper-level channels, tags for content type etc.
"Tags" format is "."-separated string which can include machine tags, Flickr-style.
"Author's signature" digitally signs the metadata entry. The entry is discarded if it is not properly signed. The author's public key is simultaneously used as a channel ID. (The signature uniquely identifies the metadata entry, and is used as a primary key for SQL storage - TBD when Pony ORM fixes python 2.7 buffer hashing problem in release 0.7.7)

Metadata size

For each metadata format, there are two versions: full and short.

Short version's serialized and signed size should not exceed 1000 bytes (for future compatibility with BEP44), so it could always be sent in a single IPv8 UDP packet.
Full version has no size restrictions. It could only be distributed in metadata torrent collections, and MUST have short version distributed alongside it, for use in IPv8 messages.
The reason for this requirements is that some content types (e.g. scientific articles) could have very large metadata sizes, that must still be searched and queried remotely (e.g. by Birds Community).

:snake: AllChannel protocol :snake:

It's like a "snakes and ladders" game, channel tags used as "snakes" or "ladders" to other channels. This requires nested channels support. When a node receives some channel metadata from another node, the receiver checks if the received metadata contains a priviously unseen tag leading to higher-level channel in the channels hierarchy and queries the peer for this channel. Therefore, "the universal concepts" are exchanged with highest priority.

:bear: Channels protocol :bear:

"TasteBuddies" are established according to seeded torrents, seeded metadata directories (channels), search queries (potential privacy data leak?). Special walker regularly queries buddies for metadata updates.

:bird: Search protocol :bird:

Search engine will use the Summary prefix tree](https://sci-hub.tw/10.1109/NCA.2017.8171372). The queries could be naturally done in parallel. All the returned metadata entries are saved to the local store.

Metadata torrents

Metadata torrent (channel torrent) is formed from concatenated serialized Metatada entries (.mdblob files). When updates are published to the channel, these do first appear in their publisher's local metadata store (LMS), and published with Channels protocol. When enough new torrents were added to a channel, the new metadata entries are concatenated into a new .mdblob file which is then added to the new version of the metadata torrent. This process is append-only, so users don't have to re-download the whole torrent again. The metadata torrent is split into "file-chunks" of a fixed size (1 MB), for efficiency of processing.

Content deletion

When the user wants to delete a torrent from their channel, a special DELETE entry is added to the channel contents. This entry follows the basic Metadata serialized messages format (signatures, timestamps, etc.), and adds a single delete_signature entry that contains the signature of the original metadata enty that should be deleted.

ichorid commented 6 years ago

Summary prefix tree: An over DHT indexing data structure for efficient superset search (Nov 2017) - this is almost exactly what I came up with. The algorithm is able to efficiently look up for queries consisting of supersets of keywords. Its based on a cross between Bloom filters and DHTs. The problem is, the authors writing style is not too good... Its only 5 pages, but it badly hurt my brain :dizzy_face:

However, the idea is relatively simple: text -> tokens -> bloom filter -> prefix tree (trie) -> DHT. And the authors demonstrate it works in experiment, and scales at least to 10^6 indexed items.

xoriole commented 6 years ago

level 1: local search level 2: neighbor search, based on "trust magic" level 3: DHT long-tail content

architecture is based on 2 components and their connection tag of fixed length text-field info-hashes moderators or endorsers

first release features (strictly limited):

create channel
- add torrent
- remove torrent
vote for channel

ichorid commented 6 years ago

Update procedure

An update is always requested and never pushed.
Channel update request = channel_public_key + last_update_time + local_channel_version

Initially, metadata torrent is created by concatenating the list of metadata entries in a serialized form. Updates are stored Git-style: each change is represented as a separate file.

To add a new MD entry to an existing MD torrent, a new file NNN.mdblob is added to the torrent. The file contains all MD entries from this update.
If at some point, there are too many deleted entries in the MD torrent, the author can produce a "defragmented" version of the torrent and publish it.

synctext commented 6 years ago

All of Allchannels can be removed. If we add to each channel the ability to push 1k content item + channel votes. Votes are simply signed +1s by users. Votes are responsibility of channel. It is incentive compatible to let channels work for their self promotion. Votes for channels are future proof: grow towards votes for tags per torrent, torrent duplicates, torrent bundles, etc.

Cool?

ichorid commented 6 years ago

@synctext, the initial idea was exactly that: remove AllChannels completely. The method of dissemination of information about new channels is the Snakes gossip protocol:

I connect to my friend and tell him: "Give me some popular content that you've got since moment X (our last data exchange)"
My friend sends me some fresh metadata (containing both channel torrents and regular torrents).
I analyze it for new tags and/or Public Keys (PKs). 3.1 If there are some torrents with PKs and or tags that I have not seen before, I send another request to the same friend, asking for the channel torrent with the corresponding PK/tag. 3.2 If there is a metadata entry containing new channel torrent, I add it to my database.
I download/update all new channel torrents and add everything from them to my local database.

It is important to note that this algorithm eventually leads to spreading root torrent collections, since tags point to the parent channel.

Opportunistic update of metadata

Depending on my relationships with the peer who sent me this MD, if he sent me the outdated MD, I could push to him the newer version of this MD's parent channel. On the other hand, if I receive some MD that refers to a newer version of the channel I already subscribed to, I could decide to do a preemptive update check on that channel, from the same friend.

ichorid commented 6 years ago

@synctext, the question is, where and how do we store the vote info? It is very easy to store the votes on a person's TrustChain, but that is a completely decentralized store, it does not sum votes. To count the votes on an item globally, the whole network should periodically run something like a prefix sum algorithm. We definitely don't want that.

Of course, vote info could slowly travel the network in the form of a gossip. Trust algorithms ensure that no one would cheat on it.

synctext commented 6 years ago

Yes, ideas seem to converge. Allchannel now has global broadcast of votes. Gossip of votes and filtering based on trust would be a clear required leap forward..

ichorid commented 6 years ago

About basics of gossip protocol: from a personal standpoint, there is no difference between distrust and distaste.

If a sybil region votes for the things I like, and this vote help me filter info - I trust this vote. If a proven real person I know votes for things I don't like - I distrust the vote. One good example is "paid bloggers" phenomena: these guys maintain high-profile blogs, but sometimes inject paid material. The people who are OK with it form the blogger's auditory. Those who shun it just don't listen to the guy. The same is especially true for political preferences, etc.

We can't create the perfectly objective view of reality for everyone, but we can give a person the tools to make her own view of reality consistent.

synctext commented 6 years ago

If-Modified-Since primitive for content discovery makes a lot of sense.

1. I connect to my friend and tell him: "Give me some popular content that
you've got since moment X (our last data exchange)"
2. My friend sends me some fresh metadata (containing both channel torrents
 and regular torrents). I analyze it for new tags and/or Public Keys (PKs).
3.1 If there are some torrents with unknown PK+tag that I have not seen before, I send
another request to the same friend, asking for channel torrent with the corresponding PK+tag.
3.2 If there is a metadata entry containing new channel torrent, I add it to my database.
4. I download/update all new channel torrents and add everything from them to my local database.

The only information you need to download a channel, sample torrents, and check votes is an info_hash. Libtorrent and Flatbuffers can do all the hard work. Every channel community is redundant. Allchannel is redundant. You only need to gossip top-5 most voted upon channels, last seen random 5 channels. When talking about a channel, you provide the "genesis hash and the current-version hash". That's it I believe. Most-minimal-design-probably.

Search community could initiate "metadata insertion promo walks" to facilitate clusterization of metadata. Unsolicited metadata pushes (spam) are mitigated by accepting metadata only from peers with high trust score.

Sounds complex... In the past 12 years we have been live we gotten lots of spam. Features have been removed out of production because they where not spam-resilient. Our trust and reputation function are still very simplistic. The above idea does not sound like something that is ready for testing on forum.tribler.org by next release deadline of 1 August.

A search query is sent in the form of a simple logical expression based on tags and ranges.

Those are perfect attack vectors for a denial-of-service attack, right?

devos50 commented 6 years ago

We had a pop-up developer meeting where we briefly discussed the basics and future of the existing AllChannel 2.0 mechanism (see https://github.com/Tribler/tribler/pull/3489). This comment briefly describes what we came up with. Main goal: simplicity

First, we make a distinction between a channel producer (creator) and consumer (a user that browses through the channel).

The channel API (for channel consumers):

subscribe(infohash): subscribe to a channel and start downloading its content using libtorrent. This method takes an infohash that contains the channel information (metadata, magnet links etc).
unsubscribe(infohash): unsubscribe from a channel. This will remove the download from the libtorrent engine.
notify_channel_updated(new_infohash): callback when a channel has been updated by a channel producer. This will update the download in the libtorrent engine.
sample_channel(infohash): update the piece priority of a channel. This can be used to implement something like the preview channels like we already have in Tribler.
get_channel_votes(infohash): returns a list of all cast votes on a specific channel.
check_channel_votes(infohash): filter the list, returned by get_channel_votes, and remove invalid (Sybil) votes.

voting on a channel Users can cast their votes on a channel by creating a half-block in TrustChain (without any counter signature). Note that this is not technically possible at the moment in TrustChain and such blocks will be marked as invalid by the block validation logic. Votes that are present in the blockchains of users can be collected by the channel producer and included in the torrent content.

ichorid commented 6 years ago

Full Text Search

We will use FTS4/5 from SQLite. To take word morphology into account we will use Snowball stemmer. We will produce tokens for FTS on our own (because we don't want to develop a wrapper and a binary version of Snowball for SQLite). sqlitefts-python could be used as a wrapper to connect external stemmers to SQLite

For the first versions of Chant we will still use Porter stemmer embedded in SQLite.

Tags and title terms share the same search space. When added to the local database, the terms obtained from stemming the words of the torrent metadata title are checked for duplicates with the tags.

synctext commented 6 years ago

Significant overlap with nearly finished code https://github.com/Tribler/tribler/pull/3660

ichorid commented 6 years ago

Tagging system

Fighting tags spam.

"Describers" produce more value for the emergent semantics of a folksonomy, than strict "categorizers".

The Complex Dynamics of Collaborative Tagging

synctext commented 6 years ago

Mental note... This sprint aim to do a big leap forward for metadata. Similar how IPv8 collapsed the code base dramatically. By trying out a new approach for metadata, we make a solid step forward. Sprint deadline is 1 august 2018, functional release on forum.tribler.org

ichorid commented 6 years ago

Nested channels

Channels form a hierarchy, like a directory tree structure. At the top level, there is no common root ( the Forest of Trees structure).
Each channel is tied to a public key (PK), which functions as a permanent ID (and could be used, for example, for voting on TrustChain or distributing updates a la BEP46).
Each metadata entry (MD) has single parent field that points to MD's parent channel in the hierarchy (by providing the parent's PK).
A leaf MD's parent always equals to the MD creator's PK.
If the user gets a new MD entry with previously unseen parent PK, Chant automatically fetches the parent channel's MD from the same source that was used to obtain the child's MD. Then, Chant checks if that channel really has the original MD entry. If there is none, the source is considered untrustworthy.
The process continues recursively until it meets the top-level channel MD ("the Root of the Tree in the Forest").

Notes on security:

The recursive process of going all the way up cannot be abused, because Chant always asks the sender of the original MD entry for the whole chain of parent channel torrents, and checks for incorrectness at each level. Therefore, it could not be used for DDOS and amplify attacks.
The user won't see the association of a child channel to the parent channel until the parent channel is verified to really contain (approve) the child. This protects higher-level channels from unsolicited associations.

qstokkink commented 6 years ago

For reference, this also overlaps: #3484.

synctext commented 6 years ago

As the design is becoming more clear, I propose investing some time in a dataset with 1 million real items and performance analysis. Build scientific proof for the Tribler stack, credibility, and good quality assurance infrastructure.

build upon the completed foundations 3 commits of Quinten
have a list of 1 million magnet links
operational code for turning those links into a Libtorrent swarm
performance analysis of adding 1 magnet link to this existing swarm and re-seeding it (Jenkins job)
implement the Channel API as proposed above
determine how 1-million-item channel published by one computer is downloaded by another computer (end-to-end performance experiment)

Open Science is a nice candidate to use as a dataset. For arXiv.org : Total number of submissions shown in graph as of June 15th, 2018 (after 26.9 years) = 1,403,151. Outdated stats for Amazon S3 download: The complete set of PDFs is about 270GB, source files about 190GB, and we make about 40GB of additions/updates each month (2012-02).

The overengeneering approach: "CORE’s aim is to aggregate Open Access content from across the world. We provide seamless access to over 30 million metadata records and 3 million full-text items accessible through a website, API and downloadable Data Sets."

ichorid commented 6 years ago

Gossip DB schema

When the gossip is deserialized from a file or a network message, it goes through several checks:

Filter by trust based on PK. Chant compares its author's PK to the list of known good/bad PKs.
Check for duplicate gossip stored in the local database, based on gossip's signature.
Check for integrity based on PK and signature.

If the gossip passes these checks, it is processed and added to the local database. The database schema mimics the gossip format, but introduces some additional fields, like the addition moment timestamp and stemmed index.

qstokkink commented 6 years ago

A bit of a memory dump; this is the old way of checking for duplicates:

Store only correctly signed messages by mid and global_time.
Drop incoming messages with an already-known mid + global_time.

The system worked like this, so that you could publish the same message twice. This is useful for messages which are not stored long term.

That said, I don't think we need to support publishing the same message twice anymore and therefore your gossip design should work.

ichorid commented 6 years ago

Stellar Core docs and some quick hands-on experience got me sold on XDR. It is infinitely simpler and allows the programmer to do almost C-like pointer tricks, which is important for crypto signing, etc. On the downside, it has no automatical schema version control. But our schema is very simple, so it will be OK.

devos50 commented 6 years ago

Note: If we decide to go with the existing code in #3489 and store all infohashes on TrustChain, it probably allows someone to publish a malicious infohash on their chain (maybe with questionable content) which I will start to download automatically. How can we address this?

qstokkink commented 6 years ago

@devos50 my best idea for solving this at the time was to make the swarm private using BEP44.

ichorid commented 6 years ago

@devos50 , in my design we never store infohashes on TrustChain, we merely post there a signature of the introduced metadata element. Torrents don't belong to anyone, as is information... And no one would ever download any torrent automatically without checking the author's trust rating. And the only thing that is downloaded automatically is a small piece of parent channel torrent, and only in the case the child torrent/channel was not signed by the same PK. See https://github.com/Tribler/tribler/issues/3615#issuecomment-397359742

And by the way, even in @qstokkink's design you could store hashes of the hashes on TrustChain, just for proof of authenticity, and you can't download that :wink:.

ichorid commented 6 years ago

In some sense, I implement the basic idea of BEP44 in this community.

ichorid commented 6 years ago

Posted a "Technology Preview" of the ORM-XDR-Libtorrent binding in the form of a pull request https://github.com/Tribler/tribler/pull/3695

devos50 commented 6 years ago

Please explore whether we can exclusively use TrustChain to store/gossip metadata.

ichorid commented 6 years ago

Lightnet and Darknet

Lightnet

Imagine a vast, sunlit field. There are farmers. Some grow carrots; some grow cabbages; some grow tomatoes. Different vegetables require different growing conditions, so people look around and group according to their preference. And some people don't grow anything, but still want to get, say, both tomatoes and cabbages. These people then take their place between tomato and cabbage people, for convenience. That is the model of a Lightnet. Everyone knows what their buddies are up to. This model is very successfully exploited by "vk.com" social network and Twitter. (Do note, however, that Facebook is not entirely a Lightnet since the corporation forms your feed according to its algorithms. Its like if the farmers in our metaphor would opt to give out all of their product to some central cartel, which would install shops that would mix and match the food on the shelves according to its whim.)

Darknet

Imagine now, that some of our farmers produce or consume some very smelly fruit (like durian). And producing this fruit is frowned upon by some of their fellow farmers. To stop being shunned, durian-producers would have to move their farms to the underground caves under the field. Their main concern would be hiding their true identities, instead of showing them. But durian is a tasty fruit, so some people of the light-world would still like to taste it... And they will be willing to pay a higher price for staying anonymous. The only way to stay anonymous is to mix with the crowd. There is no anonymity if one could be distinguished from the others in any way. So, durian-producers won't even talk about their regular-fruit preferences during their anonymous meetings in the dark, from fear of their true identities being revealed. And then they create a network of underground tunnels that belong to no one, leading to no particular farm, and, what is much more important, hiding the things that move through them from curious eyes. But to be able to sell their fruit to the up-dwellers they still have to have some help from outside. And to maintain their anonymity while still being able to communicate and make collective decisions, they have to employ complex security and consensus protocols. Thus, the Darknet protocol is born. Best examples are TOR, Tribler, Dash.

Conclusion

The point of this little essay is to show that you can't combine Lightnet features (trust, showing individual tastes) with Darknet features (distrust, hiding individual tastes) in a single network. The only way it is possible to combine these is by establishing a trusted 3rd party (e.g., Facebook).

Tribler consequences

Pub/sub primitives form the Lightnet, while hidden tunnels form the Darknet. Our Tribler client should by default provide "split personality disorder" feature: taste-based pub/sub for Lightnet with one personality, hidden tunnels with another. There should be no interference between this personalities at all. Like, if I download something through my Darknet pubkey, it should never appear seeded, or pub/subbed through Lightnet, and vice-versa. Any violation of this rule instantly exposes the network to both correlation attacks and legal consequences.

One way to get Darknet closer to Lightnet in functionality is to make the social network graph bipartite: the first type of nodes being users, second type - channels. To maintain anonymity, the user client will have to create a new identity to connect to each new channel. However, this still misses the most powerful aspect of a social network - personal relationships.

qstokkink commented 6 years ago

@ichorid If I understand correctly, long story short: anonymous payouts don't/won't work?

ichorid commented 6 years ago

@qstokkink, anonymous payouts with Trustchain work only while these don't leave the Darknet. To get to the Lightnet, they should pass through some kind of "mixer". And this should definitely use some consensus algorithm.

synctext commented 6 years ago

Focus on raw-performance and simplicity
Most recent specs, link above
- focus on Search/AllChannel/Channel
- Converted to XDR religion https://docs.python.org/2/library/xdrlib.html (like BEncode, but it has no field names, or named dictionaries) Semantic-free...
- PR getting in shape
ORM used is Pony ORM
- direct link to main Dev https://github.com/kozlovsky
- Code tested for 1 day on Twisted
Tribler/community/chant/orm.py please use functional names and don't repeat Dispersy mistakes. Names should make sense within 1500ms :-)
Metadata is very generic, please consider using NeighborQuery.
Next tasks to consider LibtorrentWrapper(note: very invasive global changes) and startup latency

synctext commented 6 years ago

Please start making parts of the code stable. For instance, seeding a Libtorrent-based "GiGaChannel". Perhaps make PonyORM another PR.

As stated above, we need a lot of magnet links for testing. Ideally 1 million magnet links without any legal issues in any jurisdiction. We want the science angle, for instance, donate your bandwidth to science. Credit mining for the "Knowledge of Humanity". Here are 9250 .PDF articles in permissive CC 4.0 license. We can republish them by seeding from any seedbox: https://journals.aps.org/search/results?sort=recent&per_page=20&category=openaccess

ichorid commented 6 years ago

Turns out the LevelDB-backed storage for collected torrents already uses the name metadata_store in config files. In the future, we will rename it to torrent_cache component to better reflect its nature and pave way for the new MetadataStore module serving Channels 2.0. For now, as a temporary measure, the config options for the new MetadataStore module will instead use chant_ prefix.

ichorid commented 6 years ago

Some performance numbers for GigaChannel based on https://github.com/Tribler/tribler/pull/3923 with 1M entries:

Time to create and dump to SQLite: 450 seconds.
DB size on disk: 480 MBytes
Time to commit 1M entries to a channel torrent: 644 seconds.
Channel torrent size on disk: 204 MBytes

It is important to note that Pony only dumps to DB when db_session lock is released, and before that, every object is stored in main memory. 1k Pony objects take about 3 MBytes of memory. Therefore, it is important to process entries in batches up to 1k to keep the memory load low and maintain responsiveness. Increasing the batch size does not seem to affect SQL database writing performance. Default page size for SQLite is 1024 Bytes, and default page cache size is 2000 pages. This means that query results should fit into 2 MBytes to be fast. (1 MByte for Android version) So, to make the code able to effeciently commit more than 1k entries per commit, we'll probably have to modify the commit procedure to account for this limitation.

UPDATE: it takes 300 seconds to read 1M entries from channel torrent and put them into SQLite. 200 seconds with uniqueness check disabled. Max memory usage is 3,5 Gigs. Ouch.

UPDATE 2: Limiting db_session scope to a single .mdblob increases channel torrent directory processing time by 10%, but max memory usage drops to 50 Mb. Added this optimization in https://github.com/Tribler/tribler/commit/ed8636a49afdb9599cddf3a5ac43d3597b23f87f

ichorid commented 5 years ago

Some thoughts on GigaChannel design:

You meet a stranger. He offers you gold and tells you: "Maybe you cannot trust me, but you can trust this gold". That is the reason why we sign each metadata message individually - eventually, you can create a rated list of trusted PKs ("gold") and safely get updates signed by those PKs even from complete strangers.
- Entry identity means that no more than one thing with a given ID can exist simultaneously. Each channel entry has a numeric ID equal to its initial TrustChain pointer. The combination of PK+ID uniquely identifies channel entries. Over time, a new version of entry could be provided by its author. For example, for a TV show: "Series Foo, ep1-5" with infohash X can become "Series Foo, ep1-6" with infohash Y. From Gigachannel point of view, this is still the same thing. Update logic: "if I receive an entry that has the same ID as something I already have, I will replace the old one if the new one's TrustChain pointer is > than that of the old one."
- Nested channels structure exists as a set of relationships in the database. A special metadata type can be added to a channel to point to its parent channel ("ext2 .. inode style").
- In the database, entries are linked and managed Python-object style: links count, references to parent and children, garbage collect. This allows to simultaneously deduplicate metadata entries and always keep them up to date.

ichorid commented 5 years ago

The problem of duplicate entries with different signatures (i.e. the same torrent uploaded by different users) is solved by partial deduplication. Whenever a GigaChannel instance receives a sample of a channel with entries with infohashes that are already known, instead of adding those into the database, it links the received channel to the infohashes that are already there. However, to preserve the integrity, of gossip this deduplication is never applied to contents of channels that the user is subscribed to.

Deduplicating subscribed channels remains an open question.

Dmole commented 5 years ago

Just KISS and have 1 html page per channel that the channel owner can implement anything ( including all of the above) with.

ichorid commented 5 years ago

@Dmole , unfortunately, we can't implement that until we got the trust function up and running. Otherwise, the channels will drown in the sea of sybil-generated trolling content. But yeah, something like Markdown description was in the plans.

Dmole commented 5 years ago

Tribler already has a 'trust function' to permit only the owner to add content to a channel. Bandwidth tokens have some fundamental flaws so don't wait on that. Don't use markdown (it's to limiting), simple and flexible html is why the www was successful go with it.

ichorid commented 5 years ago

@dmole, its more complicated than that. Say, you create a channel, something like "My video channel". You set up a nice-looking HTML page for it. Then someone creates a channel that is named just the same, but instead of nicely-looking HTML it contains some VERY explicit content and/or virus links etc. Then, that bad guy creates millions of copies of that fake channel with different public keys. The result is that your nice-looking channel will literally drown in the sea of this garbage.

Welcome to the world of Sybil attacks! :wink:

Dmole commented 5 years ago

The internet has the same problem and no one cares. (As long as pages are accessed via a domain/ip/public-key-fingerprint that can't be impersonated) think .onion

ichorid commented 5 years ago

That's the problem. We don't have domain names.

Dmole commented 5 years ago

But Tribler does have public-key-fingerprints which are successful replacements for domain names. :)

Obviously some work to implement but better to work in the the direction of flexibility instead of having to revisit this yet again.

synctext commented 5 years ago

We are now duplicating all features of Youtube. Plus anonymous streaming as a key target.

Bundeling HTML and javascript execution would blow up our executable to 200MB+. Perhaps in the future if we're having lot of channel creators. Also requires we solved hard security issues. As a university we're not satisfied with recreating the security nightmares of The Web. It's about reinventing The Internet... Instead of HTML+js we want Python swarms. Maybe we can get that to work. See our IPv8 plugin stuff.

vi commented 5 years ago

... build a distributed replacement for trackers system. It would be very nice if you describe what you think is necessary for that.

Tribler is already a distributed replacement for trackers system, albeit limited and not feature-rich.

To make it more practical, Tribler needs to address those limitations and add features.

One of ways I see is supporting customized UIs, mimicking tracker interface of some existing topical torrent tracker, but with Tribler as backbone. Tribler should offer easy access to subset of torrents, each carrying some metadata, easily usable by a "Tribler apps" - UI plugins.

Some metadata fields ideas:

Language
Video resolution
Part-or-whole status: stand-alone (not series), collection of series, single serie
Description (maybe also comments)
Voting by users

Low-sized fields should be like a part of magnet link, bulky metadata like a description may be accessible as technical "mini-torrents".

Repeating the crucial point: UI-centric Tribler plugins / apps - like "trackers" inside Tribler. Maintainers of those apps will request more features from backbone. This would prioritise needed features in a natural way.

Dmole commented 5 years ago

Well you could trim tribler down to just anonomus torrent transfer; moving tokens, streaming, channels, search, etc to optional plugins (I would bet most don't use any of that stuff anyway)

I just think curated torrent indexs are the one thing Tribler relies on most, so it would be nice (and not hard) to have a plugin for a distributed one.

Maybe have a poll or add some statistics users can opt out of.

vi commented 5 years ago

trim tribler down to just anonomus torrent transfer

Maybe. Native Tribler UI (not though some "Tribler app" prism) may be directed at power users and be necessarily more complicated (but enabling you to find any torrent in existense whatsoever). Nice UI may be handled with Tribler Apps, some of which may be pre-installed and available on startup page, directing non-power users to proper UI.

Tribler / tribler