Redesign of the Search/Channels feature

ichorid commented 6 years ago

Use cases.

We strive to provide the same kind of service that is provided by:

Highly organized repositories of torrents (trackers)
Scientific articles repositories.
Music playing services, like Spotify.
Video playing services, like Youtube.
Scanned books archives, like Google Books.
Internet libraries

On fuzzy searches.

We do not try to become a distributed Google. Google is good at finding the knowledge that is relevant to words. It is used when the user does not know what he or she is looking for. Our user always knows what he or she wants. Therefore, we should disable metadata (or torrent contents) search by default, and only search by the torrent name. On the other hand, the user should be able to use metadata to refine the search results. Like, "sort by date of creation" or "sort by tag/channel", etc. This model already works perfectly for all use cases indicated above (except Youtube).

Data structurization.

We must be able to replicate the basic information organization structures that are used by our example use cases:

Tracker forum : tree-like structure, traversable both up and down. When the user finds a torrent/channel, he or she should be able to see its immediate parent/children nodes, and the "root" of the hierarchy. Thus, nested channels.
- Scientific articles database: a single record in the database should be able to hold an arbitrary amount of metadata fields, and each field must be able to be used as a base for a hierarchy. For example, top-level owner/channel "ScientificArticles" could provide "pseudochannels" for "authors" "journals" "years" "tags", etc. These are essentially selecting the metadata field that would be used to build the structure "just in time". Advanced users would be able to select several criteria at once, to find the intersection of metadata sets. Thus nested channels can fully exploit the features of the underlying relational database.
Music search: should be fast and simple. This could be only achieved by implementing a distributed cloud-based search, with some data cached at user's device. We can't get new user do download XX Gigabytes search database, especially for mobile usage (that is basically everything for music playing). One search query should take 5-30 seconds, and it's results should include related info (like, the search for a single song should get that song and reveal the link to the whole album). To solve this problem, at some point we could introduce search mining. Fast hosts producing highly relevant search results get credit.
- Libraries and book archives fall somewhere between scientific articles and tracker forum.

It is important to note that a single instance of the system is not required to serve all of these features at once. Instead, it should provide building blocks, or modules, to compose the use cases as necessary. For an average user a single database could be used to serve every type of content, but, e.g., users interested in building a huge scientific database should be able to set up a separate database instance optimized for scientific articles.

Constraints on design choices

The design should consist of several completely independent parts:

Networking subsystem
- Data transfer protocol
- Trustworthy gossip protocol
- Storage subsystem
- Database/store
- Cache management / mining algorithm
- Search subsystem
- Distributed search protocol
- Distributed search algorithm
- Data distribution algorithm
- Metadata format

It is very important that we do not make any of these parts depend on specific implementation of one another. We must be able to design the system in such a way, that it would be trivial to exchange one implementation of some part for another one.

User experience requirements:

Search results should appear faster than the user can process them.

Implementation requirements:

Scalability (should only sync neighbors, and not the whole network)

vi commented 5 years ago

moving tokens, streaming, channels, search, etc to optional plugins

No. That is part of backbone. All those features should be accessible for plugin authors. Extracting them to plugins introduces tricky dependence system between plugins.

I expect plugins to be:

UI-centric
Requiring designer and community manager skills more than programmer skills
Not dealing woth "complicated things" like keeping distributed system, preventing overuse, anonymity and streaming.

Currently Tribler looks like a thing make by programmers for power users. That change may make Tribler look like a thing made by designers for all users. Some users may just think they are using some weird-ish mirror of existing torrent tracker they know and love without even realizing about distributed networks and tokens.

Dmole commented 5 years ago

Using a web browser as a tribler UI is one option (less user friendly but more flexible and way less code).

Markup in python, like how epub is used (ignoring most tags), is another option (more user friendly, but more code, and likely limiting one day)

Now that I have expressed my view of collecting data, trimming the unused features, then adding the one feature that would make tribler self contained (curated torrent indexes) I'll plant a more radical idea; let peoples library organizing be cumulative by making a user friendly UI for git-annex (already has a torrent backend) run it on the anonymity network, add a trust chain to auto merge contributions, use hard links to support multiple name formats, a branch per uploader... but that's hard to make user friendly (like a 4d cube) and more like a tribler replacement so I don't expect any takers, but I just have to wish for it every few years.

Dmole commented 5 years ago

BTW on the subject of meta search one should be aware of https://github.com/Jackett/Jackett

synctext commented 5 years ago

Some metadata fields ideas: Language Video resolution

Yes, we had lots of students working towards the perfect crowdsourcing system. We really want this. Note I programmed stuff this myself back in the days (like really 20 years ago). Recent student project: implement a generic richt metadata layer with functioning crowsourcing community, abandoned sadly: https://github.com/Tribler/tribler/issues/2455. I can't find students for this at the moment unfortunately.
9 years ago, operational prototype: https://repository.tudelft.nl/islandora/object/uuid%3A52b586ea-6144-4b4e-a5a1-b05255ce493a 8 years ago, operational prototype: https://repository.tudelft.nl/islandora/object/uuid%3A1a54c1e2-5904-463e-92ae-2d8ab14d8378

I'll plant a more radical idea; let peoples library organizing be cumulative by making a user friendly UI for git-annex (already has a torrent backend) run it on the anonymity network, add a trust chain to auto merge contributions, use hard links to support multiple name formats, a branch per uploader...

Indeed, very much more radical :-) In 2011 we had operational prototypes for this, hard to get operational for a few million people. "Referencing within evolving hypertext”, ACM Hypertext 2011, http://www.st.ewi.tudelft.nl/victor/articles/cturl.pdf “Citrea and Swarm: partially ordered op logs in the browser”, http://www.st.ewi.tudelft.nl/victor/polo.pdf

ichorid commented 5 years ago

BEP38 adds "similar torrents" support in the form of infohashes and name-based "collections". This can increase the efficiency of updating channel torrents, especially in the case of nested channels. Libtorrent supports it.

ichorid commented 5 years ago

Also, Libtorrent supports the signature-based DHT entry update system. We can use it to push updates to GigaChannels.

synctext commented 5 years ago

Please don't use the exotic BEP38 or the DHT-as-a-filesystem extensions of Libtorrent. Only use the mainstream features. We're still looking for a volunteer to fix Tribler streaming and Libtorrent piece picking priority mechanism.

ichorid commented 5 years ago

@synctext , why shouldn't we use these features? Because its DHT, or because these are Libtorrent-only features? What's the reason to avoid them?

ichorid commented 4 years ago

Added TODO list for the 7.5 release.

ichorid commented 4 years ago

The real problem of designing an information-sharing system lies in it eventually overwhelming the human user with the sheer amount of content. Human brains evolved to efficiently process spatial information from visual sensors in a highly dynamic environment (i.e., running through the forest), not the static text-based information that is the Channels system. Ultimately, the problem is about managing complexity.

Some approaches to solving the problem:

Google search essentially provides us with the "brain-extension" middleware that plays the same role for processing the text resources of the Web as our visual cortex plays for processing the visual information about the current environment.
Encyclopedias (and catalogs in general) try to make the information search time logarithmic by splitting the info into a structured hierarchy with each level small enough to be processed by the human brain.
"Popular" lists use herding effect among humans to steer the individual towards the most common solution.

Another important point is that human consciousness operates in terms of possible actions. Therefore, the human interface should be interactive (e.g., browsing Pinterest through thematically connected pictures).

If creating autonomous self-replicating botnet for anonymity is about making a human-symbiotic cyber-lifeform that provides anonymity services, the ultimate goal of the GigaChannel subsystem is to create the neocortex extension to aggregate and collectively filter media information.

To reach this goal, the Channels system should become antifragile. Each new piece of data added to the system should make it easier to navigate. In a sense, the system should be constantly "knowing itself" to enhance the user experience.

ichorid commented 4 years ago

The channels system can be viewed as a discrete evolutionary / Bayesian inference process. In this case, user interaction with the channels system plays the role of the fitness test / belief update. The channels hierarchy corresponds to living systems hierarchy, and gossiping of subscribed channels becomes replication.

One important point here is that the user's desires change with time (say, new show episodes appear daily), and serving up-to-date content is crucial for a channel's survival. However, Tribler app is not running 24/7 on a typical PC, so it is hard to do exploration of channels space in search of updated info. Instead, Tribler app at the user PC has to rely on the other nodes' databases to serve fresh content to the user.

ichorid commented 4 years ago

When nested channels will be merged, we can start to account for easiness of finding info in the channels system. Whenever a user downloads a torrent from a channel, the GUI can record the process that led to this decision. Each information element shown to the user (torrent or channel) will add to the information search cost. The toplevel channel that contained the torrent will be bumped according to the search cost (the more the cost, the less the bump). This internal rating ("local fitness") will then be used to prioritize gossiping the subscribed channels around.

ichorid commented 4 years ago

(source)

This means that there are two classes of networks employed in file-sharing: 1st type is Zipf, static content, structure networks (e.g. BitTorrent tracker forums); 2nd type is "top-flattened" Zipf, dynamic content, content class networks (e.g. popular torrents of some category). Both are necessary for efficient file-sharing. 1st type works as a navigational meta-network for the 2nd type.

ichorid commented 4 years ago

Channels structure/network should be independent of the actual physical peers/users network. Channels fight for "attention space" of the user.

The channels system consists of 3 network layers:

Top layer: top-level channels network, organized as a flat set. Ranked with temporal PageRank (e.g. VSIDS) or another random-walk heuristic.
Channel structure layer: mid-level individual channel structure. The "skeleton" of the channel, organized as a tree. Almost never changes, cached locally.
Bottom layer: a stream or a set of individual content entries (e.g. torrents). Frequently updated, cached only if subscribed.

Each layer is served by a specific kind of overlay network. Top level features unbiased random walk - based popularity ranking. Mid level is served by per-channel communities emphasizing caching and low latency updates. The bottom level is served by per-category subscriptions and serves low-latency updates too, but also features collective editing.

ichorid commented 4 years ago

Structure information is atomic. A message describing only a part of an update of a structure is meaningless and even harmful, as it breaks correctness of the structure. Therefore, structure update messages should be applied as a single transaction.

The domain of a subscription is its public key. If I subscribe to a channel, I automatically become subscribed to all sub-channels of this channel that belong to the same public key. If the subscribed channel includes sub-channels by other publishers, these must be subscribed individually.

ichorid commented 4 years ago

Currently, we include the pointer to an object's parent in the object itself. This is very efficient and convenient for atomic updates of channels' contents. However, to support pointers and subchannels, we must switch to using adjacency lists. However, that moves us from the land of trees with a single parent to the land of DAGs, meaning that recursive searches become tricky (especially in regards to building acceleration structures, such as material paths or transitive closures).

ichorid commented 4 years ago

Splitting the metadata exchange in this 3 levels means we don't really need neither libtorrent, nor set-reconciliation as backends for the Channels system:

at the top level, we do simple random walks. Repeatedly receiving the same channel sample works as a vote for the VSIDS-based rating system
at the middle level, the structure is expressed in a serialized stream of messages. It rarely changes and if it does, it is typically some small changes, and these are just a stream of events.
at the bottom level, the stream of torrents is completely serialized. For each leaf category, the number of torrents is small enough that libtorrent could be only employed to download the bulk of it once. After that, updates can come as a serialized stream w/o need for any reconciliation.

Channels in v3 will become stored as a stream of messages with contiguous timestamps. As every message is signed individually by the stream's original author, this allows for very efficient download of updates simultaneously from multiple peers w/o need for complex "set reconciliation" algorithms. Essentially, this protocol describes the transfer of data from its original author (the peer that signed the messages) to any willing peer over any kind of media.

qstokkink commented 4 years ago

Do note that IPv8 has been created (and functions right now) as a free-wire-format low-community-count overlay toolkit/framework.

Your mid-level use-case (as well as TrustChain witnesses and the hidden services' PEX community) would need a fixed-wire-format high-community-count architecture.

I'll work on this as I retool IPv8 (in light of the IPv6 changes https://github.com/Tribler/py-ipv8/issues/615), but performance will not be optimal for this kind of usage until then.

Final note: IPv8 is used for control messages (bursty) and is therefore incompatible with high-throughput applications. If you have a need for high throughput, try to use libtorrent as much as possible.

ichorid commented 4 years ago

To facilitate fast bootstraps, we can explore the network for new entries via an "exploration-exploitation" algorithm that uses a running average of non-duplicated received entries to regulate the speed of the search (the number of requests in-flight) and the domain of the search (request more samples from the same peer again vs go to another peer).

At the responding peer, usage of SQL statements like ORDER BY RANDOM can result in DDOS. We can avoid this problem by only allowing it in certain kinds of requests, caching the whole query w/o random selection aggressively and then selecting random entries in Python code. This is fine as the Channels' bootstrap process is stochastic. We do not allow bad things like COUNT there either. The actual query packet format should be very limited and is not based on SQL at all.

For the future channel-level communities, we do not need COUNT and RANDOM -like queries either, as this kind of info already exists in toplevel channel entries.

ichorid commented 4 years ago

"Secure Scuttlebutt" (SSB) - basically, the implementation of layers 2 and 3 of the above scheme.

ichorid commented 4 years ago

Currently, Tribler Channels system sends channels data via concatenated lists of serialized and compressed metadata entries. While very efficient on the wire (100 Mb torrent containing 500 000+ entries), it is very inefficient when uploading to the local SQLite database. In fact, it is so inefficient that on an average PC processing 100 000 entries can take a few hours. There are two reasons for this inefficiency:

we're uploading entries in small batches to keep the reactor responsive. This results in lots of small transactions, that, while safe, lead to horrible write amplification.
adding an entry locally requires updating integrating data structures such as indexes and helper tables (e.g. torrent checker stats table).

Problem (1) can be solved, at least partially, by shortcutting all the data movement inside SQLite itself and implementing blocking offline updates with GUI notification. Problem (2) reflects the ultimate question of data processing: who integrates the info? And who serves the results? If we rely on the channel creator/owner to maintain the integration, the user will still have to create the helper structures like infohash index for torrent checker and integrate the search results from different channels on their own, at each request. With every added torrent it becomes more costly to integrate more data into the database. Also, storing every possible torrent locally is like buying every item in a shop "just in case": an average user will never need even 0.1% of info served by a big torrent tracker, save for the whole network.

ichorid commented 4 years ago

To solve the problem of data storage and integration the Channels system must work smarter, not harder (thanks, Edmond Lau!). In particular, a peer should rely on other peers in its vicinity for storing and (at least partially) integrating the info. Every peer should store some part of the network's total info, e.g. some medium-sized thematic channels. Then, whenever one connects to the Channels network they will seek the neighbourhood with maximum diversity of interests to cover as much of the channels' search space as possible. This will enable the Channels overlay to serve search queries and channel's contents on-demand from the neighbourhood, rather than rely on the local database.

In this case, the efficiency of bulk transfer mechanisms become irrelevant, as there will be no bulk transfer between users.

Also, ontology-style channels providing structures consisting of other channels become even more efficient, as the leaf channels will be served on-demand.

vi commented 4 years ago

we're uploading entries in small batches to keep the reactor responsive.

Maybe size of such batches can be dynamic and dependent on whether GUI is active or minimized in backgroud? If user is actively interacting with channels and search and downloads, aim for 1-second batches. If Tribler window is hidden (or user just watches a video in it), aim for 20-second batches.

ichorid commented 4 years ago

Maybe size of such batches can be dynamic and dependent on whether GUI is active or minimized in backgroud? If user is actively interacting with channels and search and downloads, aim for 1-second batches. If Tribler window is hidden (or user just watches a video in it), aim for 20-second batches.

Tribler already uses dynamic batch size with an algorithm that adjusts batch sizes to occupy the reactor for approx. 100ms. The problem is, while you can interrupt SQLite transaction from Python , even with 1-minute batches the thing takes hours due to all the SQLite-Python data copying (tried that). Also, taking hold of the database for more than a couple of seconds will require creating a separate async-enabled priority queue of database requests, because of other components of Tribler that use the database implicitly, e.g. TorrentChecker and GigaChannel Community.

Having said all that, this can be another way to alleviate the problem. Thanks for the suggestion, @vi !

ichorid commented 4 years ago

The Web is not just an indexed collection of books. Wikipedia is not just an indexed and extended list of articles from Encyclopedia Britannica. Spotify is not just a music catalogue from your local music store. Forum-based torrent tracker is not just a list of files from users' disks. Facebook is not just a global telephone book with pictures. Instagram is not just a collection of users' photos.

Every successful disruptive innovation never just copies the old thing and puts it onto a better media. It breaks the thing to its most basic parts and then assembles it differently. Like a predator organism chewing down on a prey's flesh to break it down to basic amino acids and then integrate it into its own.

Channels system should not copy torrent trackers. It should provide an alternative way of organizing torrents. A way that is compatible and shaped by its medium: a decentralized system of small and medium machines.

Dmole commented 4 years ago

takes hours due to all the SQLite-Python

Incase you don't know already; "begin transaction;" significantly(3x+) speeds up SQLite.

ichorid commented 4 years ago

@Dmole , we use the wonderful PonyORM for all DB access in channels. It starts transactions automatically. Thanks for suggestion anyways.

Dmole commented 4 years ago

As long as it's not doing more than one SQL transaction per user click, or minute, or %GB. (a transaction per insert is just as slow)

ichorid commented 4 years ago

Decentralized recomendations system and gossip protocol. Seems very close to what we're trying to achive here.

ichorid commented 4 years ago

Different kinds of services require different kinds of architecture to be competitive. For instance, a tracker forum should focus on fast updates and dynamic content like collectively maintained "popular" torrents list and torrent health states. A scientific articles database should focus on index completeness and metadata search. So, tracker forum is best served by direct sending of individual data fragments, while the articles database is better served by sharing an already indexed database through libtorrent.

ichorid commented 4 years ago

Channels UI must impose rules that will force users to build channels that fit best to the Channels architecture. I.e. use nested folders with not-too-many entries, etc.

synctext commented 4 years ago

Current design issues:

Frequent changes of root hash, commit on each change, fresh commit makes a new files. Results in lots of small files
Downloading in bulk is fast using Libtorrent. Parsing and inserting in SQLite is the new bottleneck, may take a full day. Possible fix: https://www.sqlite.org/lang_attach.html Other idea: heatmap and gradual download of each channel you use. Partial seeding, gradual download principle. 50% random 50% content preference. 1MByte mdblocks?
Gigachannel overlay is functional. Would not scale to Youtube-sized traffic in which each of countless channels pushes their updates. Possible fix: only do daily updates of infohash to Gigachannel, create dedicated community for each channel.

2021 requirements:

First be superior to 1 use-case in terms of features-completness and stability; Triblerd; fix GUI later.
sharp single-minded focus: True GigaChannels with 1 billion magnet links #3971.
open-access archive for 1,706,172 scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
Branch out later into a generic system
Assume large deployment for future. The 10 Biggest YouTube Channels 2018:
- PewDiePie. Subscribers: 66.1 million.
- Dude Perfect. Subscribers: 34.54 million.
- HolaSoyGerman. Subscribers: 34.48 million.

ichorid commented 4 years ago

Channels 3.0: NanoChannels

NanoChannels will use the NoodlesNew overlay by @grimadas as the backend. The architecture fits perfectly: low-latency, low bandwidth, CRDTs, no consistency guarantees.

NoodlesNew overlay is used as a transport backend for updating channels data, replacing Libtorrent in this role. MetadataStore SQL schema will be changed accordingly.

The most important feature of NanoChannel is the ability to subscribe to subchannels. The user will get updates to all the subchannels he subscribed to, and updates to the structural changes of the parent channel. NoodlesNew will be subclassed to handle nested subscription in a smart way. If there is a channels tree A->B->[C1,C2], and the user is subscribed to C1, the overlay will connect and expect the updates from other peers who are subscribed to A, B, or C1.

SQL acceleration structures for tree hierarchies is used to facilitate fast searches into subchannels both locally and remotely. Remote search is served by the already existing RemoteQueryCommunity overlay. New users are able to use the system immediately, as RemoteQueryCommunity serves channels structure data "on-demand".

synctext commented 4 years ago

The most important feature of NanoChannel is the ability to subscribe to subchannels.

This is not something users have asked for or will understand. They are used to Youtube, you play something and it starts streaming instantly. We want to get rid of the need to managing a library and local harddisk space. We can't ask our users to manage subchannel subscriptions. We are facilitation existing social media practices, not trying to pioneer new ones with our small userbase.

to explain further the credit mining and other essentials; Everything should be automated in a few years. Especially seeding and contributing to the health of the network without doing anything. Youtube is the future of television, understood best by seeing an 8 year old professionally browse through vulcano or dinosaur videos (some of those 8-year olds have 6 years of social media experience from tablet-based viewing). In coming years the goal is to have that level of usage and automation inside Kodi TriblerD.

Parsing and inserting in SQLite is the new bottleneck, may take a full day.

Please create some performance graphs of this for deeper insight. From first MByte insert time to 1 million magnets. By 1st of July I hope that the small .lz4 file problem is fixed and the focus can shift to strong type checking, etc.

ichorid commented 4 years ago

Youtube is the future of television, understood best by seeing an 8 year old professionally browse through vulcano or dinosaur videos

That's the main problem of YouTube and its brethren. In a relentless pursuit to make people watch they treat all of us as 8-year kids who can't and should not decide for themselves. Ultimately, BigTech decides what you and your kids watch. I thought our mission was to change that.

Though, I completely agree that everything should be automanaged in the backend. The ability to subscribe to a topic is probably the last act of free will that is left to a person in this world of "~~daddy~~ YouTube knows better".

synctext commented 4 years ago

Brainstorm today: we keep Libtorrent for this year, till X-Mas no big changes. Minimal Tribler 7.6 improvement of channels: add filename and 10 demonstration channels of Arxiv.

https://arxiv.org/help/stats/2019_by_area/index#math_yearly. Minimal viable product: every day a single torrent gets created for each of the 10 unique science areas on Arxiv. 10 Channels to replace the whole of Arxiv. Bundle all science papers into a daily or static bundle of 10k .PDF files. Static files of published papers. Also seed these big static swarms. Filename format suggestions _title-of-paper_authornames (truncated to 1K for UDP packet size restrictions)

ichorid commented 4 years ago

Channels design choices

Distribution

Method

	:+1: Upsides	:-1: Drawbacks
Torrents	:rocket: Fast for bulk transfer :mechanical_arm: Reliable :rainbow_flag: Arbitrary content type :car: Congestion control :traffic_light: A CDN in a guise	:clock1130: Very high latency (DHT, etc) :broken_heart: Updates fragment network (infohash change) :dizzy_face: Horrible to integrate (two reactors, etc.)
UDP	:rocket: Perfect latency :rabbit: Efficient for small updates :+1: Simple to integrate	:bomb: No congestion control :boom: No transmission integrity control :shrug: Not a CDN

Format

	:+1: Upsides	:-1: Drawbacks
Bunch of files	:bulb: Simple :rabbit: Compact :mechanical_arm: Robust	* :tired_face: Manual integrity control
SQLite DB	:electric_plug: Interfaces directly with existing DBs :rocket: Indexes :zap: Can be queried without processing

Strategy

	:+1: Upsides	:-1: Drawbacks
Append only	:bulb: Easy and safe management :rabbit: Efficient data transfer	:raccoon: Garbage collection :snail: Requires processing before use
State transfer	:zap: Instantly usable	:boom: Horribly inefficient data transfer :dizzy_face: Tricky management

Storage method

	:+1: Upsides	:-1: Drawbacks
Single DB	:bulb: Easy management	:snail: New data must be integrated before use (i.e. indexed)
Multiple DB	:zap: Instantly usable	:dizzy_face: Horrible management (threads/processes/DBs/locks) :boom: Manual data integration on each query (slow, complex)

Combinations

Our current design is described as: "Torrent" + "Bunch of files" + "Append only" + "Single DB"

Dispersy design was: "UDP" + "Append only" + "Single DB"

7.6 proposed design is: "Torrent" + "SQLite DB" + "Append only" + "Single DB"

E.g. @qstokkink 's original design was: "Torrent" + "Bunch of file" + "State transfer" + "Single DB"

At some point, @synctext proposed the following design to facilitate instant channel availability: "Torrent" + "SQLite DB" + "State transfer" + "Multi DB"

Formally, there are 32 combinations of these design choices. However, some design combinations are forbidden or meaningless, because, for instance, one can't use "SQLite DB transport format" with "UDP" transfer.

ichorid commented 4 years ago

Tree

MindMup FYI

EDIT: @kozlovsky informed us that moving full DB's is no option because of security issues.

synctext commented 4 years ago

That is a great write-up of the design space. very helpful. Please always post the results of exploratory performance test. Even small tests and their results are helpful for a data-driven methodology. User-requirements for the Youtube generation with short attention span:

You only get 10 seconds to prove yourself (users are impatient and un-install quickly)
Display the names of the popular channels within those 10 seconds
Sort by popularity by default and newly discovered channels are inserted at the correct line
Subscribing to a channel instantly gives GUI feedback (both in current screen and download of channel screen)
progress bar of channel download
Download of channel is real-time and content is displayed on the screen. Each MByte downloaded on disk is both parsed, inserted and displayed in real-time. Replace the existing full-download-then-insert approach. Stall the download until the block is processed and show a progress bar (number of torrents inserted).
Users do not understand that you can click on hearts and it fetches more content. Something that needs clicking needs to be a button.
Please prioritise these these issue and put those in the release, before fixing the "SQLite record insert performance issue". That is for database/ORM experts to solve.

We are now in major-pull-request freeze. Small incremental changes only for coming 12 months please, while we close issue 1 (anon,freeriding,content).

synctext commented 4 years ago

@ichorid Please make a roadmap / ToDo list for you for the end of this year.

synctext commented 4 years ago

please make a roadmap. Include "content graveyard mechanism" on there or we rely on Sandip PopularityCommunity.

ichorid commented 3 years ago

A roadmap of performance-enhancing changes for 7.6.

Problem	Current state	Proposed change
Channel commit process	ORM queries	Recursive SQL queries
Channel transfer format	Concatenated compressed mdblobs	SQLite database (most of the data) + compressed mdblobs (small changes)
Method of DB access	Mostly reactor thread with some costly operations offloaded to the threadpool	Priority queue for operations on an external thread, separating blocking and non-blocking operations.
Database blocking	Congestion control on the main thread - breaking costly operations into smaller parts	Blocking database for write access and indicating it to the user through a progress bar in the GUI
Adding big databases	Direct processing	Swap the database in case the added DB is bigger than the the user's DB, then add the user info back (requires discussion)
Database schema	Single table	Multiple tables, signed blobs stored in a separate table (so no need to re-serialize)

Applying all this stuff should result in Channels commit and processing performance 10-20 times. The downside is the DB is going to take 2 times more disk space than it takes now due to all the added indexes, etc. Also, DB management code is going to become much more complex and less reliable.

A roadmap of Channels features for 7.6

Sub-torrent contents support (scientific papers, music tracks)
Recursive subchannel search. Requires maintaining
a material path or other recursive indexing methods.
GUI: paginated channels view instead of infinite scroll
GUI: "New arrivals" feed (with infinite scroll)
GUI: automated channel preview downloading
GUI: slight changes to channels table layout (e.g. add a separate "Download" and "Subscribe" buttons)
Garbage collector for old and unmanaged torrents/channels

ichorid commented 3 years ago

A short estimate of the total data size that we want to ultimately host, based on current world stats:

130M books
40M-400M torrents (e.g. 20 biggest trackers hosting about 2M entries each)
5M English Wikipedia articles
80M scientific articles
50M Spotify songs

In total, that's no more than 1G entries. At the inception of the Channels system, we may expect about 10M entries (most popular torrent trackers content).

Our userbase is currently about 20K users. WIth 10M entries in mind, that's about 500 entries per user. Based on RemoteQueryCommunity stats, I estimate there to be about 200 (1-in-100) Tribler users online at any given moment. That makes it to about 50K entries per online user. Getting to 1M users and 1G entries will keep the proportion roughly the same (about 1K entries per offline user and 100K entries per online user). At any moment we want the entries to be available in 10-100 copies. This further increases the perfectly distributed load to 1M-10M entries per user. Currently, 1M entries in the database take about 300MB. We can safely assume that to increase the efficiency of indexing we can up that number to 1GB DB size (additional indexes, rich tags, etc.).

So, the baseline for a truly distributed Channels system is completely feasible: 1 GB DB per user.

However, in the case of a totally replicated DB design, we face the DBs size of 100s of GBs (e.g. 80M articles DB will take about 80GB disk space). 2-4 top trackers will take about 10 GB. Search and indexing costs for DBs of such size become increasingly prohibitive.

Now, the question is: how to distribute these 10M-1G entries (and indexes) among users' machines? 🤔

Essentially, there are 3 ways to do it:

Single DB per channel, ATTACH or thread-based. See the comment above.
DHT-like distribution.
Single DB, integrate everything into it (this is how we do it now).

qstokkink commented 3 years ago

Historically, our biggest problem with large databases (~1GB) is startup time (e.g., #255, #4898). Practically, this leads to users straight-up removing their entire .Tribler dir, as Tribler is pulling in too much garbage and decimating user experience (e.g. https://github.com/Tribler/tribler/issues/4441#issuecomment-477442384).

You could probably get people to store other people's stuff by using some form of payout (like IPFS/Filecoin, or Trustchain in our case), like we have done for bandwidth. Not doing so will probably lead to the same situation as before, where users just remove their .Tribler directory.

My thoughts on this matter are: don't distribute data to users; have users pull data in out of their own volition and reward them for doing so.

ichorid commented 3 years ago

@grimadas pointed to this oldie-but-goodie(2003) on Gnutella scaling (GIA): "Making gnutella-like P2P systems scalable"

topology adaptation, flow control, one-hop replication, and biased random walks .. the combination of all of the components into a comprehensive system design .. adapts each component to be “capacity-sensitive”. .. not any single component, but in fact, the combination of them all that provides GIA .. large performance advantage".

We can steal the design but use token-passing instead of chain requests. IPv8 should benefit from node degree differentiation.

EDIT: mental note: Ultra-fast rumor spreading in social networks. The graph's degree distribution must follow the power law with base 2<b<3

synctext commented 3 years ago

The insight from this 17 year old Gnutella++ paper is still pretty relevant:

1: P2P clients are extremely transient.
2: Keyword searches are more prevalent, and more important, than exact-match queries.
3: Most queries are for hay,not needles.

However, their "dynamic topology adaptation" would mess up IPv8 pretty badly, putting a bias into a random overlay messes up load balancing. We have a Journal paper on that. Dynamic topology adaptation: "puts most nodes within short reach of high capacity nodes. The adaptation protocol ensures that the well-connected (i.e., high-degree) nodes, which receive a large proportion of the queries, actually have the capacity to handle those queries"

ichorid commented 3 years ago

putting a bias into a random overlay messes up load balancing

I would still prefer to do a Gumby experiment for that. Also, getting some statistics on node churn vs capacity in our current network would be extremely useful.

synctext commented 3 years ago

We have an experiment for exact that from a while ago, but sadly it is unmaintained. With our focus on crowdsourcing and high-quality metadata channels, we should be resilient to churn. IPv8 handles its side of things nicely.

ichorid commented 3 years ago

Ok, no need to overengineer now. This optimization is orthogonal to the current and potential Channels designs, so we can always apply it later if necessary.

ichorid commented 3 years ago

Let's say we want to design a system that, like the Web, serves unknown data on-demand. The basic peer story will work like this:

:deciduous_tree: :deciduous_tree: :deciduous_tree: : Channels Discovery Overlay bootstraps into the network by connecting to some random peers and continues to walk the network randomly. Whenever it meets a new peer, it queries it for channels that this peer serves/subscribes. (Already implemented in 7.5 by Channels and RemoteQuery community). The purpose of Channels Discovery Overlay is to fill the channels database by "walking the forest" randomly at the top level (see top layer). Coincidentally, it builds the channels popularity rating by means of VSIDS (already there since 7.3).
:rainbow_flag: Channels Availability Overlay uses channels availability data collected by Discovery Overlay to build and maintain such a set of peer connections that optimizes coverage of channels not subscribed/served by us. E.g. if we're subscribed to channel A, and there are two peers available, one of them serving {A, B} and another one serving {B, C}, the overlay will prefer to connect to the peer serving {B, C} because it increases the potential availability. (See maximum coverage problem)
:open_file_folder: The user opens a non-subscribed channel entry in the GUI. The Remote Query Community sends a request for the DB data to one or more peers known to serve the channel.
1. :open_file_folder: :question: :loudspeaker: If the channel ID is not known to Channels Availability Overlay, it will send the request to several peers at random (probably based on dissimilarity of their Bloom filters). The request can have a TTL of 2 or 3, meaning that the peers will compare the ID against their Channels Availability Overlay neighbourhoods, etc.

The Channels Availability Overlay could use some kind of set reconciliation algorithm (e.g. sketches, Bloom filters, XOR filters, etc.) to communicate its subscription set to its peers. Note that the channels subscription sets are relatively small (e.g. 10-10 000 entries) and rarely change. Also, it is probably not worth it to salt the channels PK+ID hashes. These properties will allow fitting the sketch into a single UDP packet and send it very rarely.

Tribler / tribler