ichorid commented 6 years ago

Use cases.

We strive to provide the same kind of service that is provided by:

Highly organized repositories of torrents (trackers)
Scientific articles repositories.
Music playing services, like Spotify.
Video playing services, like Youtube.
Scanned books archives, like Google Books.
Internet libraries

On fuzzy searches.

We do not try to become a distributed Google. Google is good at finding the knowledge that is relevant to words. It is used when the user does not know what he or she is looking for. Our user always knows what he or she wants. Therefore, we should disable metadata (or torrent contents) search by default, and only search by the torrent name. On the other hand, the user should be able to use metadata to refine the search results. Like, "sort by date of creation" or "sort by tag/channel", etc. This model already works perfectly for all use cases indicated above (except Youtube).

Data structurization.

We must be able to replicate the basic information organization structures that are used by our example use cases:

Tracker forum : tree-like structure, traversable both up and down. When the user finds a torrent/channel, he or she should be able to see its immediate parent/children nodes, and the "root" of the hierarchy. Thus, nested channels.
- Scientific articles database: a single record in the database should be able to hold an arbitrary amount of metadata fields, and each field must be able to be used as a base for a hierarchy. For example, top-level owner/channel "ScientificArticles" could provide "pseudochannels" for "authors" "journals" "years" "tags", etc. These are essentially selecting the metadata field that would be used to build the structure "just in time". Advanced users would be able to select several criteria at once, to find the intersection of metadata sets. Thus nested channels can fully exploit the features of the underlying relational database.
Music search: should be fast and simple. This could be only achieved by implementing a distributed cloud-based search, with some data cached at user's device. We can't get new user do download XX Gigabytes search database, especially for mobile usage (that is basically everything for music playing). One search query should take 5-30 seconds, and it's results should include related info (like, the search for a single song should get that song and reveal the link to the whole album). To solve this problem, at some point we could introduce search mining. Fast hosts producing highly relevant search results get credit.
- Libraries and book archives fall somewhere between scientific articles and tracker forum.

It is important to note that a single instance of the system is not required to serve all of these features at once. Instead, it should provide building blocks, or modules, to compose the use cases as necessary. For an average user a single database could be used to serve every type of content, but, e.g., users interested in building a huge scientific database should be able to set up a separate database instance optimized for scientific articles.

Constraints on design choices

The design should consist of several completely independent parts:

Networking subsystem
- Data transfer protocol
- Trustworthy gossip protocol
- Storage subsystem
- Database/store
- Cache management / mining algorithm
- Search subsystem
- Distributed search protocol
- Distributed search algorithm
- Data distribution algorithm
- Metadata format

It is very important that we do not make any of these parts depend on specific implementation of one another. We must be able to design the system in such a way, that it would be trivial to exchange one implementation of some part for another one.

User experience requirements:

Search results should appear faster than the user can process them.

Implementation requirements:

Scalability (should only sync neighbors, and not the whole network)

synctext commented 3 years ago

Thank you for posting you overall design online, very helpful! This design is needs to be simplified. First deploy something which works. Only then can we take the next itteration. For instance, quantify performance bottlenecks and consider optimisations with "set reconciliation algorithm" or Bloom filters.

Channels Discovery Overlay bootstraps into the network by connecting to some random peers and continues to walk the network randomly. Whenever it meets a new peer, it queries it for channels that this peer serves/subscribes.

Good idea. However, make it simpler with merely 1 message: ChannelDiscovery. Only purpose is to fill the channels viewing screen. There is no need for queries, no pub/sub. When doing random walk across this community every peer pushes a subset of their subscribed channels blindly to others or random channels (50/50 probability). However, only promote a channel if they have at least 1 content item of this channel stored locally (non-emptiness protection at protocol level). Included fields: public key, channel name, collected channel items, total number of channel items, last update date, and signature. Anything missing?

Is there documentation of your deployed "channel popularity algorithm" with the 5-stars, please include.

Channels Availability Overlay uses channels availability data

Good idea. However, make it simpler with merely 1 message: RandomChannelContent and complete randomness. If people leave Tribler on for 1 week it should have receive 1 message per second in this community. That is sufficient to start with for now. Everybody in 1 IPv8 community and everybody hears random stuff. Or are we missing functionality then? This is a deliberate conservative design. Only purpose is to randomly gossip the items within channels. Before blindly sending out random item content, do a 50/50 probability to either send subscribed channels content or non-subscribed content. Included fields: public key of channel, item (torrent hash, swarm-name, swarm size in Bytes, optional "directory" within channel, and signature of channel owner. For swarm with multiple files: something like the 25 filenames of the largest files inside the swarm are included {or less to fit inside UDP packet}. We used to have code for that.

The infinite sub-directories is not something our users are asking for, so it can be removed. This simplified the 3-layer design into a 2-layer design where each items comes with a single "directory" name optionally.

The Remote Query Community sends a request for the DB data to one or more peers known to serve the channel.

That might actually be tricky. How would that work, simple subset of 20 peers in a community? Or are we expanding the connected peers to 200 peers to have a decent probability for a hit? You seem to be thinking of doing an on-demand random walk and keeping the connection alive with peers of interest, peers with overlap in subscribed channels. We called those taste buddies 14 years ago. Are you planning to keep a list of Internet addresses of peers seen in last 10 minutes with channel they know about?

The request can have a TTL of 2 or 3

Query forwarding is something that invites abuse. How would you stop an amplification attack? 1-hop designs have proven sufficient.

Whenever it meets a new peer, it queries it for channels that this peer serves/subscribes. and (probably based on dissimilarity of their Bloom filters).

This is not a bad design, but it seems like you are proposing existing designs without being aware of them. Dispersy first was "extensive query-based", before the developer discovered that a simple blind random push is superior. There is a professor in Amsterdam who wrote publications for 12 years on how to design this gossip overlay stuff. It's advisable to know his findings, but that take several years of study. Protocol design is not something you learn in a few years. Additionally, this design is slowly drifting towards our failed Open2Edit crowdsourcing system.

The random walk and Bloom-filter design you specify there is almost an exact re-creation of a high-performance version of our prior deployed channel system? Please study that design first and indicate what you want to do different and why. We did Bloom filter usage calculations in the past and published it. Please document your design here in .MD format, we do a feedback round with the development team next week, and then implementation may start.

Expensive lessons learned by Dispersy team: Bloom filters have been proven to cut bandwidth usage in half at least. However gossip uses an insignificant amount of bandwidth: 1 Kbyte per second is already a lot. Bloom filters turned out to be fatal design mistake in our prior channel implementation: it introduced a lot of complex and impossible to debug code for a minor bandwidth saving.

Sadly these lessons of failure are not documented. I've started a scientific paper on "myths and misconceptions", design patterns to follow. This was written after 10 years of measurements and experiments with self-organising P2P systems. Period: :drum: :drum: :drum: 2001 - 2011. I'll try to finish these lessons when we have a million users, a single positive data-point.

ichorid commented 3 years ago

There are two kinds of information exchange in any information network:

Type 1: pushing unknown info
Type 2: requesting known info

Examples of Type 1 (unknown) info exchange: telling fresh gossip to your friends; sending a message to your bank telling that you want to spend your account money on something, etc. Examples of Type 2 (known) info exchange: asking your friend to recite a poem they memorized earlier, asking your bank to send you details of your bank account state, etc.

Type 1 exchange is naturally served by push-based gossip. Type 2 exchange is naturally served by client-server or DHTs. It is infinitely dumb to try to serve Type 1 requests by client-server architecture, because, as mentioned by @synctext , you'll have to query hundreds and thousands of peers at regular intervals to get some new info. This creates a horrible network traffic amplification antipattern. See Kazaa and DC++ search request flood nightmare. The same is true for the reverse situation: trying to serve Type 2 requests by gossiping around everything and trying to store all possibly required info on every node. See Bitcoin log replay problem and our current Channels with 1M+ entries.

ichorid commented 3 years ago

@synctext , to address your concerns:

Type 1 exchanges will still use blind push-based gossip, as it does now. It works perfectly, though to speed up bootstrapping we use aggressive pull-based walks. There are no plans to use Bloom filters here.
Type 2 exchanges will use RemoteQueryCommunity-based requests to get necessary info on-demand. There are no plans to us Bloom filters here either.
Bloom filters (or something like that) will possibly be used to only exchange subscribed channels lists. These will be very short (10-100 entries) and rarely change.
Using knowledge of these subscription lists, every peer will try to stabilise to a neighbourhood that serves it the best. There is a single criterion for this: cover as many popular channels that we do not subscribe as possible.

How would that work, simple subset of 20 peers in a community?

The peer will only query those neighbours that it knows to serve the channel. It knows which node serves what from the exchange of subscription lists that happens on each peer introduction.

Are you planning to keep a list of Internet addresses of peers seen in last 10 minutes with channel they know about?

Not just that, but actively selecting for the most responsive and useful peers, trying to come up with a stable neighbourhood.

Good idea. However, make it simpler with merely 1 message: ChannelDiscovery. Good idea. However, make it simpler with merely 1 message: RandomChannelContent

It already works like this since the beginning. No one is going to touch that, I assure you :wink:

Please study that design first and indicate what you want to do different and why.

Dispersy tried to sync everything. It tried to optimize that task using Bloom filters. No amount of optimization will save a design that tries to copy everything around. It is akin to downloading the whole Google index and Wikipedia to your PC, "just in case". It does not matter if you use BitTorrent or direct download, your disk will blow up :boom: Instead, we're going to serve channel contents on-demand, just like the Web does with web sites. And the only thing that is going to be blindly gossiped around is the list of subscribed channels.

Every pushed list serves 3 purposes:

Type 1 information exchange ("see, this is some stuff you probably don't know about")
Building channels popularity rating (already works thanks to VSIDS)
Allow the peer to optimize their neighbourhoods

synctext commented 3 years ago

Bloom filters (or something like that) will possibly be used to only exchange subscribed channels lists. These will be very short (10-100 entries) and rarely change. Using knowledge of these subscription lists, every peer will try to stabilise to a neighbourhood that serves it the best.

That is interesting, you meaning building a semantic overlay! That would be quite a step forward for scalability and finding long-tail content. Many tried, everybody failed so far (including us). Please first finish your 4th development iteration your existing design, before diving into research on semantic overlay. Keep iterations small please, split repairs of prior work and new feature Pull Requests. Thus: progress bar, instant-download, faster insert time as requested in a comment above. {repeating}

[ ] Subscribing to a channel works instantly + gives instant GUI feedback (both in current screen and download of channel screen)
[ ] progress bar of channel download
[ ] Download of channel is real-time and content is displayed on the screen. Each MByte downloaded on disk is both parsed, inserted and displayed in real-time. Replace the existing full-download-then-insert approach. Stall the download until the block is processed and show a progress bar (number of torrents inserted).
[ ] Users do not understand that you can click on hearts and it fetches more content. Something that needs clicking needs to be a button.
[ ] Please prioritise the above issues and put those in the release, before fixing the "SQLite record insert performance issue".

Semantic overlays are really difficult and unsolved. You can't connect to peers of interest, a design which can cope with 90% blocking by NAT boxes is hard. Rich metadata is something our users seem to want more than long-tail. (btw every peer is already announcing each subscribed channel using the channel seeding+DHT, would be good to try to re-use that in the future semantic overlay iteration)

ichorid commented 3 years ago

instant-download, faster insert time as requested in a comment above Please prioritise the above issues and put those in the release, before fixing the "SQLite record insert performance issue".

These two sentences contradict each other. I do not understand what you mean.

Subscribing to a channel works instantly

I do not understand what does it mean either. Subscription is already instant (a single click!), but the download/processing takes a long time and it is going to ever be that way until we switch to on-line channel contents fetching .

Each MByte downloaded on disk is both parsed, inserted and displayed in real-time.

I can't understand this either. What does it mean "displayed in real-time"? Like, showing thousands of entries flying through the Tribler interface each second? This is going to kill the GUI performance and daze the user. Maybe you meant showing the progress of channel download? But that is already in another point of your checklist. I am confused. :confused:

Stall the download until the block is processed

This is akin to using handbrake on your car if you need to go slower, instead of shifting the gear: dumb and wasteful way to do a simple thing. It will severely interfere with the pattern of channel torrent updates and basically kill the availability of newly-created big channels. I don't know why you keep insisting on this technical detail. If you wanted to show the progress of channel processing, that is solved by a progress bar. The only reason to stall the download is to synchronize the channel processing progress bar with the channel torrent progress bar (which the user is not supposed to see at all, and is for debugging purposes only). I don't see any other value in this solution.

you meaning building a semantic overlay!

What I propose has nothing to do with user's preferences, semantics, etc. It is just a simple and convenient scheme of predicting (or, in some sense, pre-caching) user preview queries to the most popular channels. And, coincidentally, improving the quality of remote search.

devos50 commented 3 years ago

Using knowledge of these subscription lists, every peer will try to stabilise to a neighbourhood that serves it the best.

Since you build a structured overlay where peers with similar interests (taste?) cluster together, it sounds very similar to a semantic overlay. This concept has been extensively explored by one of our former PhD students.

ichorid commented 3 years ago

I've read that work on semantic overlays. Again, I'm not going to connect to peers with similar preferences, I'm going to connect to people who, together, give the best coverage of the most popular channels. To serve previews of these channels instantly.

ichorid commented 3 years ago

Recent comments by @drew2a and some cues from how users name their subchannels (e.g. with the same names as big popular channels) point us that the concept of the Channels GUI should be redesigned to resemble Spotify, etc. The whole idea of "Subscribing" to a channel is counterintuitive and should be abandoned. Instead, the user should add the desired channel to "My channels" list. This action will trigger the actual subscription procedure behind the scene.

This redesign idea is one of the reasons why I'm hesitant to do incremental changes to the current GUI Channels design.

synctext commented 3 years ago

the concept of the Channels GUI should be redesigned to resemble Spotify, etc. The whole idea of "Subscribing" to a channel is counterintuitive and should be abandoned.

OK, abandoning the current channel subscription paradigm is quite something. Please elaborate how "My Channels" new logic would work? What do you propose instead of "subscribe" or "like"? We need to have a clear design and end-goal before starting implementation work.

Subscribing to a channel works instantly

I mean no waiting time within the GUI before it's added. Now its takes a long time before a channel subscription is processing and added as a news swarm download within the downloads panel. Channel progress bar is something else then a download bar, it start instantly and is only finished when all information is inserted in local database.

synctext commented 3 years ago

Keyword searches usually takes 3 minutes to resolve. This issue was reported before (using cool IPv8 health monitor). It might be the another components blocking the main reactor and I've posted this to the wrong issue! Possible corner cases caught: typical loyal Tribler user database of 1.9 GByte metadata.db, numerous Tribler database version upgrade for 2+ years, 9 subscribed channels, and Ubuntu 18.04.3 usage with this Jenkins build. We need to develop infrastructure to catch these bugs automatically. See below what I managed to catch (100% reproducible). First keyword search is fast (sometimes happens), then a 3 minute wait for search results to show (usual case). Suspecting something is blocking the reactor, might even be TorrentChecker. @devos50 has full debug .Tribler info and logs. Tribler_slow_keyword_search__Sep2020

Health monitor (blib that show once in a while during keyword search): Screenshot from 2020-09-17 10-12-46

devos50 commented 3 years ago

This issue seems to be caused by an unresponsive Tribler core. This can be further debugged by closely monitoring the duration of HTTP requests to the core, as reported by the application tester:

request_times_boxplot

qstokkink commented 3 years ago

There's also @egbertbouman's ipv8/asyncio/tasks REST endpoint which shows long-running tasks. Maybe also nice to poll and integrate that in the application tester at some point?

ichorid commented 3 years ago

Tested @synctext 's DB, was not able to reproduce the problem. It must be something regarding TorrentChecker (it is always TorrentChecker). The solution to this is a properly designed asynchronous threaded priority queue for DB requests.

See #4320

ichorid commented 3 years ago

The distribution of the number of subscribed channels per-peer (logarithmic, slightly distorted for visibility, w/o personal channels). X-axis - the number of channels subscribed by a peer. Y-axis - the number of peers subscribing exactly X channels. The data was collected by passively listening for Channels community since GigaChannels inception.

It is easy to see that most people subscribe to 3-15 channels max (1500 peers' data collected).

ichorid commented 3 years ago

Human attention span and free time are limited. Each other subscribed channel requires clicking on it to check. These stats resemble the stats for Twitter, Telegram and other social networks: no one ever subscribes to more than a few dozen information sources.

This gives us an excellent opportunity to build an "inverted DHT" overlay to discover services/channels in close vicinity (1-2 hops) by using progressively compressed Bloom filters.

ichorid commented 3 years ago

The network is the index.
A node's degree (the number of connections) is extremely limited (e.g. 10-50 ). Live with it ...

ichorid commented 3 years ago

The idea for the new Channels design is described in #5721

ichorid commented 3 years ago

There are two kinds of links to information:

Hash-links, i.e. infohashes
Human-links, i.e. human-made ontologies, collections, opinions, etc.

The typical media collection contains human-links as intermediate nodes and hash-links as leaves. A good example is that of a file system: humans give names to directories and files, but the files are typically binary data. Also, the interpretation of human-links is subjective. For instance, in a typical "private data collection" scenario a folder with a boring name like work stuff / more work stuff / qweasd can contain a collection of images not quite related to work...

A hash-link is easy to check and spread in a distributed system. The copies can be easily cross-verified between the peers (e.g. by using a Merkle tree). A human-link can only be verified by another human, and that is subject to human consensus.

Any interaction between a human and information aggregation system (such as Tribler, or Wikipedia, or Instagram, or Facebook, or torrent tracker, or Spotify, etc) boils down to the user finding their way from the root of the index to a satisfying leaf. The process is known in marketing as "the Sales Funnel".

Our "Discovered" tab and Search feature are the roots of two different kinds of indexes:

the Discovered tab shows an index based on the "Popularity" measure at the top and then breaks down into the user-created folders system
the Search feature shows an index based on aggregating the "FTS" score from the peer's nearby nodes

An unlinked hash-link is useless, as it is unreachable. However, hash-links are often reused (e.g. pictures or videos in a CDN). This means that we need to provide public key signatures only for human-links. Binary data, such as pictures, icons, sounds, etc., should be addressed and cached by hash-link only.

In short, hashes don't lie. People do.

synctext commented 3 years ago

In 2 months it will be the 3 year birthday of the AllChannels big rewrite plans. @ichorid :confetti_ball: :three: :cake: Please focus on including the following features in a Tribler release of these months.

Sprint	Milestone	Description
March 2021	1 million item	Markdown support and simple skeleton TFTP binary transport inside channels backend
	seedbox	@drew2a Fill this channels with nearly TeraByte of Creative Commons Music
	Remote search	Determine the keyword search latency performance with lots of music tracks
April 2021	instant usability	Partial seeding of a big channel with many mdblobs. Download 250MByte of random .mdblob files and seed them. This ensures channels can grow in number of items for a while, we nicely export this problem to Libtorrent. Random partial seeders ensure full availability of channel info.
	instant download	Gossip and instant Magnet link. When receiving the latest SHA1 hash of a channel inside the gossip network it is possible to instantly start a Bittorrent download. Currently another step is going on in the background which is unclear to the user.
May 2021	Full markdown	Create a simple markdown viewer inside Tribler. Enable browsing markdown files inside a Tribler channel.
	Wikipedia import	Convert the Mediawiki content of wikipedia into markdown. Create a single static Wikipedia Markdown dump in example Tribler channel. See offline Wikipedia for inspiration.
June 2021	Magnet markdown	Support any magnet link with picture, audio or video content. External magnet links to content outside the channel are now supported. Markdown viewer initiated background loading and refreshes when download is complete. Any magnet link embedding and playback should be supported, just like we have Youtube .JPG preview in this example (however, it opens in an external player from Youtube). This step is like building HTML5 inside Tribler using markdown and VLC or some player. (Detailed design still to be discussed; content internal to channel versus external to channel. For now only support content internal to channel!)
July 2021	daily pinning	Every channel update changes the channel torrent's infohash. That will break down quickly :face_with_head_bandage:. Top-level swarm of channel is fixed daily. Real-time updates are send using gossip. All the gossip of a single day is bundled into a new series of .mdblob and everybody subscribed switches 1 time every day to single new top-level hash. This is to ensure we keep as close as possible to 1MByte .mdblobs
2021-2026	Global Brain	Incrementally grow content collection with static example channels. Let the community use our features, we showcase what is technically possible. Grow with more content and broader ecosystem. From importing all know Creative Commons content to all open-access medical and scientific based publishing

ichorid commented 3 years ago

@synctext , looks like a solid plan! However, I can't understand what you mean by:

Partial seeding of a big channel with many mdblobs. Download 250MByte of random .mdblob files and seed them. This ensures channels can grow in number of items for a while, we nicely export this problem to Libtorrent. Random partial seeders ensure full availability of channel info.

The "partial seeding" thing is especially unclear to me in the context of the "instant usability" milestone. If I understand correctly, you propose seeding some channel torrents speculatively to facilitate channel availability (1) and reduce the "waiting for full channel contents" time (2). Also, to solve the channel scaling problem (3) you propose seeding random subsets of mdblobs from big channels (4).

Now, let's analyse those ideas from the standpoint of the current architecture ("append-only log -> SQLite"):

"Increase channel data availability by seeding biggest channels speculatively" - indeed, this will increase the availability of the seeded channels. However, we'll have to devise a rule on how to select the speculatively-downloaded channels. Perhaps, the current channel popularity heuristic will fit the bill. :+1:
"Reduce the waiting time before the channel is shown to the user" - this will marginally increase the perceived channel serving speed. Currently, for a reasonably fast PC, the ratio of time spent downloading vs processing a channel is about 1:50. For slower PCs, it is even more skewed towards processing. So, pre-downloading the channel will reduce the overall channel "download+process" time by less than 2%.
"The channel scaling problem" has two sides: indexing and updating. Indexing. A user will never use or even look at more than 1% of torrents stored in a million-torrents channel. The reason is, the Channels 2.0 architecture have to store all the stuff in a single database to index it. Updating. The rate of updates of a big metadata collection is proportional to its size. A big bittorrent tracker adds hundreds of torrents daily. Trackers compete for users with both quality and frequency of updates. Many people still prefer Usenet to bittorrent because of how fast new stuff appears there. The problem is, every channel update changes the channel torrent's infohash. Pushing more than a couple of updates a day results in a trail of dead swarms. Also, it results in horrible data amplification: a 200 bytes update requires megabytes of service BitTorrent traffic. As a result, the current Channels architecture can't scale to the size and service quality of a centralised, web-based BitTorrent tracker.
"Partial seeding of channel data" can't solve the scaling problem (see the point above). Indeed, partial seeding can solve the problem of disk storage of mdblobs. But every client will still have to store the full channel in the database, because in current architecture there is no way to split database indexes between peers. (Channels 3.0 architecture was born from this understanding, BTW :wink: ).

To sum it up, while seeding top channels speculatively would enhance the user experience, partial seeding will not achieve any positive impact.

ichorid commented 3 years ago

Also, to clarify, what did you mean by:

Gossip and instant Magnet link. When receiving the latest SHA1 hash of a channel inside the gossip network it is possible to instantly start a Bittorrent download. Currently another step is going on in the background which is unclear to the user.

Did you mean creating a feature that enables the user to "subscribe" to an actual download, for e.g. some TV series and receive the update automatically?

synctext commented 3 years ago

Sorry for being to short and cryptic. With partial seeding I mean to download a part of all .mdblob files of a single channel. Each file is still download 100%, but you don't have all of them. Libtorrent signals with want/have messages what blocks are available at which peer. Thus combined with PEX messages its clear who is around and who was what pieces. Next step is inserting the information directly into the local database after download and processing. Each 1MByte .mdblob also can be inserted atomically into the local database. Such "incremental insert" is also a big refactoring I believe of the current code. Does it now make sense?

The problem is, every channel update changes the channel torrent's infohash. Pushing more than a couple of updates a day results in a trail of dead swarms.

Yes, that is an easy problem to solve. We can post-pone till after markdown integration. daily pinning Top-level swarm of channel is fixed daily. Real-time updates are send using gossip. All the gossip of a single day is bundled into a new series of .mdblob and everybody subscribed switches 1 time every day to single new top-level hash. [EDIT of roadmap above, new July and beyond task]

ichorid commented 3 years ago

Yes, it makes sense now. Regarding the daily pinning, this will effectively make channel torrents a supplemental way of spreading the channels, targeted at new users. Also, it will still be compatible with Channels 3.0

Each 1MByte .mdblob also can be inserted atomically into the local database.

Well, mdblobs are inserted incrementally at the moment. But the hard requirement is they must be inserted sequentially. Otherwise, we'll have to use CRDTs, which seem to be overkill for the task.

Also, even if we manage to be able to insert the blobs non-sequentially, this would still give almost no speedup (or I don't understand what you propose again). If I understand correctly, you propose downloading a random subset of blobs (that is possible even now) and processing them (this will require something like CRDTs). Now, let's say processing a channel takes 1 hour. A user downloads 5% of the channel's mdblobs in the background and processes them. This takes about 3 minutes of Tribler working in the background, which is fine. However, if the user subscribes to the channel, they would still have to download and process the remaining 95% of the blobs, which will take 57 minutes. So, speculatively preprocessing a tiny fraction of mdblobs makes no sense to me.

synctext commented 3 years ago

ahh, nice! Sequential inserts should be no problem. Minimal changes, excellent. The problem is: how can we best package the entire markdown-based Wikipedia (English,Spanish,German, etc.) into a single channel? How much .mdblobs would we have?

ichorid commented 3 years ago

The torrent for multistream (seekable) English Wikipedia articles archive is about 17GB of bz2-compressed data. It is assumed that the full unpacked size is about 80GB. Say, we store the stuff in DB, Brotli-compressed, indexing article titles only. That's about 20GB of mdblobs on disk and about 30GB in the database (the overhead of 10GB is from indexes, titles, signatures, etc). That is 10x our biggest channels experiments and 6x all the channels data out there, combined.

Now come the updates... It is hard to estimate how many pages change daily on English Wikipedia, but the number of daily edits is about 200k, and the page churn is about 1k/day. So, assuming an average of 50 edits per page per day, about 20k pages change daily. That's 600k pages changed per month. Again, I did not find any info on the distribution of page retention time, but from those numbers, we can roughly estimate a monthly change rate of 10% of 6M articles.

So, in the end, the whole English Wikipedia (text only) will take about 60GB on disk. It will take about 40 hours on a fast modern PC to upload it into the database in the current architecture. Wikipedia search results will be shown alongside other channels results. Updating Wikipedia will take about 1-2 hours per month on a fast PC.

ichorid commented 3 years ago

Mental note: a very nice set reconciliation algorithm.

How the structure of Wikipedia articles influences user navigation

Search strategies of Wikipedia readers

English Wikipedia Clickstream dataset

ML simulation for Wikipedia search

synctext commented 3 years ago

ToDo note for future. Seedbox activity. Limit to Max. 2-3 weeks of effort. As a university we want to showcase the strengths of our technology, while minimising engineering time. static dumps only :exclamation: Dynamic updates are too difficult.

Cultural content: free music, creative commons
Science demo, 1 channel with lots of Arxiv papers (.PDF)
Knowledge demo, 50k markdown files. View inside Tribler? Best of Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/5

synctext commented 3 years ago

Arxiv content for big static channel: https://arxiv.org/help/bulk_data https://www.kaggle.com/Cornell-University/arxiv?select=arxiv-metadata-oai-snapshot.json https://github.com/mattbierbaum/arxiv-public-datasets https://phoenix.arxiv.org/help/bulk_data

drew2a commented 3 years ago

Wiki dumps: https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia

drew2a commented 3 years ago

Downloading of Arxiv's archive will costs us for about 100$ (based on information from https://github.com/mattbierbaum/arxiv-public-datasets#pdfs-aws-download-only)

drew2a commented 3 years ago

Seedbox activity is on pause now.

Cultural content: We are waiting for a server with enough free space to deploy the seeder.
Science demo, 1 channel with lots of Arxiv papers (.PDF): The archive is not free.
Best of Wikipedia: Needs a lot of @ichorid time (at least 2 months).

synctext commented 2 years ago

Channels are permissioned, that needs to change. Brainstorm target if tag experiment is successful: scientific journal without an editor-in-chief. Peer-review driven. Use markdown, picture and infinite amount of PDF attachment with crowd sourcing magic. The exact magic is our university goal: permissionless science. Leaderless science.

Tag terms of both journal and individual PDF files become the -exclusive- terms for discoverability. Human created tag terms dramatically enhance scalability versus all keyword from abstract as search space for all scientific papers. Stepping stone for 'decentral Google' ;-)

ichorid commented 2 years ago

Permissionless crowdsourcing is incompatible with Channels 2.0 architecture that is based on torrents. Channels 3.0 architecture fits permissionless perfectly, though.

devos50 commented 1 year ago

Early impression of content aggregation:

Schermafbeelding 2022-09-22 om 10 49 00

This is where my user journey died 😄

Schermafbeelding 2022-09-22 om 10 52 46

synctext commented 1 year ago

great stuff :clap: Suggestions for next iteration:

Focus on both "linux" and "Ubuntu" search.
- Linux is a tag given to some Ubuntu swarms plus Debian-Jessie
- Unique ability of search results beyond filename-keyword string matching
- Closing the semantic gap :1st_place_medal: :godmode: :microscope:
only display 1 snippet with aggregation, we duplicate Big Tech Google snippet and knowledge panel
- not 6 items, 6 items, 3 items, 3 items.
- only 1 snippet with 6 items, others are in traditional non-aggregated view.
- all top-4 items in a snippet are clickable on search result page (really duplicate what TUDelft search gives us: 4)
- beyond top-4 the user is forced to make the effort of another click on "other" or something
- guide user into clicking on 1st item into that 1st aggregated view.
- Therefore 1st item is most prominent on entire search result screen
- Use spacing trick that Google also applies to make "Opleidingen", "bachelors", "bekijk alle masters" very prominent (see picture below)
- Use highlighting of the "search term" inside swarm results using boldness. Duplicate Google:
- possibly remove the aggregation title, replace with clickable swarms and highlighting????
what does a user expect that will happen when clicking on the bold "snippet header"
- Just download already :fast_forward:
- Expand my view redraw entire screen :confused:
Can you think about the ideal outcome from popularity community statistics? 98% correct, 98% reliable, 98% accurate download statistics for each swam for a 1MByte test sample. You know the expected download speed and swarm size. Explain this visually to any TikTok user please :pray:
Note the GUI has bad coding smell. Its spaghetti code, hack build upon hack, build upon a hack. See 'brain-dead refresh function which works for some reason'. Developers escape to the core and only cleanup code there...
Google approach for inspiration. Really lots of information overload when they are certain that you are looking for this thing.

devos50 commented 1 year ago

The next prototype we designed:

Schermafbeelding 2022-09-23 om 15 32 10

The number of snippets shown, and the number of torrents per snippets can be configured through global variables (default number of snippets shown: 1, default number of torrents per snippet: 4). We decided to keep the information in the snippet to a minimum. A torrent that is included in a snippet is not presented in another row as a "regular" torrent.

devos50 commented 1 year ago

Now that the GUI part of the snippet is mostly figured out, we had a brief discussion on the next step(s).

The first iteration(s) will focus on bundeling torrents under similar content items. This requires changes to the underlying tags protocol and database structure.
At one point, we most likely want to perform a remote search to obtain content items that are not out our local database yet. But we're not sure how to integrate this in the search panel since searching is an asynchronous operation. We will tackle this problem later.
@drew2a will start working on a database format for content items. When a user conducts a search, this content database is searched first in order to build the snippet.

synctext commented 1 year ago

The upcoming sprint is a good idea to do a deep dive into the Linux query only. So do something that does not scale, manually tune and tweet just a single query and make dedicated regular expressions for it. Work until it is really perfect. Then everybody learned something from actual swarms, turning dirty data into perfect search. Attempt at doing 3 iterations deployed in Oct, Nov and Dec?

devos50 commented 1 year ago

Sprint Update (Information Architecture for Tribler)

PR #7070 contains the first version of the work described on both this ticket and #6214. This is also a key step towards #7064. Below are some of the insights/design decisions during this sprint.

Last sprint we decided to focus on snippets that bundle torrents that describe the same content. This sprint focussed on building the storage backend. We decided to keep our design as simple as possible while not trying to limit ourselves too much for further iterations. We will in Tribler maintain a knowledge graph with content items, torrents and tags. See the visualisation below.

tribler_kg

We believe this structure is sufficient enough to improve search results by showing snippets, and for further iterations in which we can improve other aspects of Tribler. In the underlying database, this graph is stored as a collection of statements in the format (subject, predicate, object). We took inspiration from existing work in the semantic web, also to establish some language when building new algorithms on top of our infrastructure. For example, a content item - torrent relations is represented as a statement (“Ubuntu 22.04”, “TORRENT”, <infohash>). A torrent-tag relation is represented as a statement (<infohash>, “YEAR”, “2013”). We have updated various names in the Tribler code base to reflect these changes in knowledge representation.

We do not allow arbitrary predicates in the knowledge graph. To represent knowledge, we use the Dublin Core (DC) Metadata format and integrate it with the knowledge graph visualised above. This format gives us a set of attributes that are used to describe torrents, including year, data and format. The predicate field in each statement can take one of the 15 values given by DC. We also have two special predicate fields, namely TORRENT to describe content item - torrent relations, and TAG for backwards compatibility.

In addition to the above, the tag generator has been updated to automatically generate tags for Ubuntu-related content.

There are a few points we can focus on during the next sprint:

Improve the snippet generation algorithm, e.g., a search for linux should also give Ubuntu content.
We should work towards a GUI design for editing torrent/content item metadata.
The networking backend should be updated.
Finishing #7070 (testing, migrating databases, etc).
Scalability/validation experiment on the DAS6.

One of the design decisions is whether to add a FTS index to the strings in tuples. This would allow efficient searches in tuples (which I suspect will be a key aspect of querying the knowledge graph) but adds more complexity. We need an experiment to verify the speed-ups.

drew2a commented 1 year ago

Automatic Content Item generation (Ubuntu, Debian, Linux Mint):

import re
from re import Pattern

from tribler.core.components.tag.rules.tag_rules_base import Rule, RulesList

space = r'[-\._\s]'
two_digit_version = r'(\d{1,2}(?:\.\d{1,2})?)'

def pattern(linux_distribution: str) -> Pattern:
    return re.compile(f'{linux_distribution}{space}*{two_digit_version}', flags=re.IGNORECASE)

content_items_rules: RulesList = [
    Rule(patterns=[pattern('ubuntu')],
         actions=[lambda s: f'Ubuntu {s}']),
    Rule(patterns=[pattern('debian')],
         actions=[lambda s: f'Debian {s}']),
    Rule(patterns=[re.compile(f'linux{space}*mint{space}*{two_digit_version}', flags=re.IGNORECASE)],
         actions=[lambda s: f'Linux Mint {s}']),
]

synctext commented 1 year ago

Just a reminder to deeply understand what it means to be future proof

Date	Tribler version	Description of database of Tribler
20 Sep 2005	initial import	Read the configuration file from disk
29 June 2011	v5.3.x	We ship with a full SQL database engine `CURRENT_MAIN_DB_VERSION = 7`
4 Dec 2012	v6.0	Lots of upgrades in 1.5 years `CURRENT_MAIN_DB_VERSION = 17`
25 june 2015	v6.5.2	complex DB versioning support `TRIBLER_65PRE4_DB_VERSION = 28`
11 June 2016	v7.0.0	Implemented upgrade process of FTS3 to FTS4
18 June 2020	v7.5.0	everything gets changed with gigachannels `BETA_DB_VERSIONS = [0, 1, 2, 3, 4, 5]`
28 Oct 2021	v7.11	tagging added: 1st metadata crowdsourcing `gui_test_mode else "tags.db"`

Lessons we never learned, but endured. 1) use mature libraries 2) REPEAT: avoid the latest cool immature tooling 3) never do a complete re-write 4) have an explicit roadmap with intermediaries goals 5) let the dev team make roadmap to ~~ensure buy-in~~ add realism. 6) what are you optimising exactly? 7) important final point: always remain stable; no SegFaults or technical debt

devos50 commented 1 year ago

As we are getting #7070 ready for merge, we made a few additional design decisions:

Initially we sorted the snippets based on the number of torrents inside. We changed this to sort the snippet based on the top-seeded torrent inside each snippet. This should push popular, well-seeded content items higher up in the search results.
For now we generate snippets from the first 50 search results. Since the search results should already be ranked based on relevance (which will be improved more when #7025 is merged), these first 50 search results should be sufficient to build meaningful snippets. Next sprint iterations could focus on improving the snippet generation algorithm and doing more active searches in the knowledge database.
The TORRENT resource type will be removed and we will use the TITLE relation instead.

devos50 commented 1 year ago

Sprint Update

7070 has been reviewed and merged, and is ready for deployment with our 7.13 release 🎉. 7.13 is scheduled to be released on Nov 1.

Next steps:

Extend the GUI to allow editing of torrent metadata (not necessary for 7.13 release though). Draft PR: #7099
Now that that the fundamental layer for content management has been designed, we can discuss further extension to the core algorithm, e.g., leveraging knowledge when searching (beyond snippets).

Tribler / tribler

Redesign of the Search/Channels feature #3615

Use cases.

On fuzzy searches.

Data structurization.

Constraints on design choices

User experience requirements:

Implementation requirements:

Sprint Update (Information Architecture for Tribler)

Sprint Update

7070 has been reviewed and merged, and is ready for deployment with our 7.13 release 🎉. 7.13 is scheduled to be released on Nov 1.