Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.86k stars 450 forks source link

Swarm size community: content popularity #2783

Closed synctext closed 6 years ago

synctext commented 7 years ago

The popularity of content is essential. Currently we do not have a method to determine if a magnet links points to a large swarms or a dying swarm.

This work enables novel caching approaches, see "Efficient analysis of caching strategies under dynamic content popularity".

Each peer in the network must contact a swarm to determine the size. Checking 1 swarm every 5 seconds to conserve bandwidth means it takes many hours to check several thousand swarms. It is possible to share this information with your neighbors. By gossiping this swarm size information around, swarms need to be checked only once and bandwidth is saved.

The swarm size community is mechanism to collaboratively determine the size of swarms. The step from magnet link to swarm size involves numerous steps. Resolving a magnet link means using the DHT mechanism. Checking a swarm and doing a bittorrent handshake also takes numerous UDP packets and dealing with non-responsive IPv4 addresses.

Experiments:

Science:

synctext commented 7 years ago

first startup task:

MaChengxin commented 7 years ago

I found that the official tutorial of libtorrent is a good reference, though the documentation isn't very much consistent with the code. I am running the sample code in nix-shell and it uses libtorrent-rasterbar-1.1.1. (The one in Ubuntu's package repository is a bit old. I can also build it but the building process looks complicated according to the doc.) The sample code is here. It has only downloading functionality and prints the status every 1 second. A .torrent file is used as the command line input. I am wondering what do you mean by "downloads a single piece of this swarm"? Does that mean downloading a piece of content that the swarm is sharing, or the info of the peers existing in the swarm? I am also confused about "print total swarm piece availability". Does it mean for every piece of the content, printing out the availability?

MaChengxin commented 7 years ago

screenshot from 2017-04-05 14-38-06 Task almost finished. In the screenshot: download progress, number of peers, number of seeders, number of leechers, and total swarm piece availability. Reference: https://github.com/Tribler/tribler/blob/90b85b55e80a2e6713dea85519a5ee29627ce9ea/Tribler/Core/Libtorrent/LibtorrentDownloadImpl.py

synctext commented 7 years ago

Can we obtain accurate statistics from Libtorrent. We desire the raw number of packets and length. Sadly, the peer_info, total_download, number of bytes "the total number of bytes downloaded from this peer. These numbers do not include the protocol chatter, but only the payload data"

There are session statistics with net.sent_payload_bytes and net.sent_ip_overhead_bytes

It seems accurate messages counter are kept:

name    type
ses.num_incoming_choke  counter
ses.num_incoming_unchoke    counter
ses.num_incoming_have   counter
ses.num_incoming_bitfield   counter
ses.num_incoming_request    counter
ses.num_incoming_piece  counter
ses.num_incoming_pex    counter
ses.num_incoming_metadata   counter
ses.num_incoming_extended   counter
ses.num_outgoing_choke  counter
ses.num_outgoing_unchoke    counter

Next step would be to determine the exact number of packet and bytes for downloading various Ubuntu images. The whole chain from magnet link to first downloaded piece.

The goal is single-byte accurate statistics Then we can calculate optimal strategies for connecting to swarms and sharing the swarm size handshake and crawling.

http://torrent.ubuntu.com:6969/file?info_hash=%2B%90%24%1F%8E%95%D5%3C%AC%F0%F8%C7%2A%92%E4kW%91%16%00
http://torrent.ubuntu.com:6969/file?info_hash=c%2B%8Dw%2A%90I%BD%27j%20%10%B2Q%B1J%5Cd%FC%DC

Final note, swarm size is not equal to the number of people that have seen this online video. Media consumption (content popularity) or downloads over time are very hard to estimate or measure. You would need to know the average in-swarm time, average downloading time and age of swarm to calculate the media consumption.

MaChengxin commented 7 years ago

The goal is single-byte accurate statistics Then we can calculate optimal strategies for connecting to swarms and sharing the swarm size handshake and crawling.

Can you please elaborate on how optimal strategies can be calculated based on "single-byte accurate statistics"?

Meanwhile, I was reading some literature about how to get (near) real-time info of # of seeders and leechers and came across "DHT Crawling". One of the literature is a PPT found here: https://www.defcon.org/images/defcon-18/dc-18-presentations/Wolchok/DEFCON-18-Wolchok-Crawling-Bittorrent-DHTS.pdf And the overview is in the screenshot: image It also shows the popularity of a specific content over time: image I am wondering if the "DHT crawling" thing is in line with what you think about determining swarm size. If yes I will read more about that topic to get more insight.

MaChengxin commented 7 years ago

Found a bittorent DHT crawler example (also in Python) on Github: https://github.com/nitmir/btdht-crawler Will try this first to see how seeder/leecher info is retrieved from DHT.

MaChengxin commented 7 years ago

Two more references: https://github.com/dontcontactme/simDHT https://github.com/blueskyz/DHTCrawler

-- Update -- zo 16 apr 2017 21:32:22 CEST -- simDHT just prints the infohash and the seeder's IP address and port number. It shows nothing about the number of seeders and leechers. screenshot from 2017-04-16 21-28-01 DHTCrawler shows the infohash and the "popularity" (as claimed in its README) of the torrent, so this one could be helpful. screenshot from 2017-04-16 21-30-33

MaChengxin commented 7 years ago

@synctext I have some doubt before proceeding with the experiment. Our ultimate goal is to fetch info about number of seeders and leechers (namely the swarm size) from DHT to show the real time popularity of the torrents. Naturally one possible solution path would be:

  1. make a peer detect the number of seeders and leechers of a certain torrent
  2. make a peer detect the number of seeders and leechers of multiple torrents
  3. make multiple peers share their information by means of gossiping or something

The program made in the first starup task can only show the number of seeders and leechers that peer is currently connected with. It has no knowledge about how large the total swarm size is. So I prefer to solve this problem first. Two questions:

  1. Do you agree with the solution path above?
  2. Why do we desire the raw number of packets and length (single-byte accurate statistics)?
synctext commented 7 years ago

indeed. crawling of seeing the whole swarm is difficult. We can only partially see swarms. Then we can measure various swarms. Then we can see the relative size of various swarms. With the age+current swarm size we estimate popularity..

please ignore DHT crawling, too much spam.

MaChengxin commented 7 years ago

Current progress: I tried with a Raspbian ISO torrent instead of a Ubuntu one (will change it later). Now the program stops as soon as some pieces are downloaded, but it may not be exact one piece (could be two or more). From the downloading process, we can get the following info:

from torrent status. It is likely that the overhead can be calculated by subtracting total payload download from total download. The following screen shot shows what can be acquired. image

It still requires some effort to figure out which pieces of info matter, and why the total download and total payload download is 0 when the number of downloaded pieces is 2.

MaChengxin commented 7 years ago

Succeeded in retrieving statistics (see session stats in the screen shot), failed to interpret it. image

See this issue for more info: https://github.com/arvidn/libtorrent/issues/1946

synctext commented 7 years ago

Next step is to get this operational and count the exact number of packets + their type. Goal is to measure all packets starting from a magnet link until exactly the first block is completed.

When this information from Libtorrent is combined with the packet length we get a sufficiently accurate picture. Experimental results for thesis: do this for 100-ish legal magnet links (e.g. magnet links from all Ubuntu torrents + other sources).

MaChengxin commented 7 years ago

After some effort it is finally possible to interpret the session statistics. The most relevant commit is: https://github.com/MaChengxin/expBT/commit/d0ef388aa99dbea6248292601a0ada6b71840492 In this commit, BitTorrent messages of various types are counted. According to Understanding BitTorrent: An Experimental Perspective, message types (described in pg 13-14) in BitTorrent version 4.0x include:

Type Size (bytes)
HANDSHAKE (HS) 68
KEEP ALIVE (KA) 4
CHOKE (C) 5
UNCHOKE (UC) 5
INTERESTED (I) 5
NOT INTERESETED (NI) 5
HAVE (H) 9
BITFIELD (BF) upper((# of pieces)/8)+5
REQUEST (R) 17
PIECE (P) 2^14+13 (if the block size is 2^14)
CANCEL (CA) 17

Most of them can be measured using libtorrent, as we can see is the commit above.

MaChengxin commented 7 years ago

The next steps would be to:

MaChengxin commented 7 years ago

The experiment is running now using torrent files instead of magnet links (temporarily, otherwise it would be too slow). Maybe we need to add a timing function for each attempt to download the first piece? Time is also an important factor IMO.

synctext commented 7 years ago

true. time for dead torrent filtering also

MaChengxin commented 7 years ago

Time measurement is done: https://github.com/MaChengxin/expBT/commit/5a8f3293d1daafb2cfc1b72552bd08d3662e535f After running the experiment, I have got a folder containing JSON files, each of them holding the stats for a single torrent. I am wondering how we can make use of these stats to draw some meaningful conclusions. What info is expected to retrieve from the data?

MaChengxin commented 7 years ago

First trial of stats analysis: plot the histogram of the download time of all the Ubuntu images. figure_1

We see that most torrents can download the first piece within 10 seconds.

MaChengxin commented 7 years ago

figure_1-1 Update: reduced download time for each torrent by changing the way of checking pieces from time-driven to event-driven.

MaChengxin commented 7 years ago

Personal repository for the experiments: https://github.com/MaChengxin/expBT

MaChengxin commented 7 years ago

Just noticed that there is a page describing this issue with many details: https://www.tribler.org/SwarmSize/

synctext commented 7 years ago

Real swarm measurements. Roughly 15-KByte-ish of cost for sampling a swarm (also receive bytes?). Uses magnet links only. 160 Ubuntu swarms crawled: image Experiments are serial. Only 1 swarm at a time, 500 second timeout per swarm or abort when the first piece is completed.

synctext commented 7 years ago

It seems to work! Further polish plots made during the measurements (many more here) Next step is to find more magnet links and run a detailed measurement of the correlation.

image

keep reading the related work

Thoughts: downloading 1 piece is good correlation. How good is downloading 16 pieces?

Does our content popularity estimation get more accurate if you spend more bandwidth and time within a swarm?

Dispersy community design sketch. Create a community which every second starts a new content popularity check. The outcome of this check is shared across the Dispersy overlay with 10 neighbors (fanout). That's it. Each peer in the network now obtains UDP packets with popularity starts on several swarms. 20170609_172509 Future steps: align with existing TorrentCollecting in Tribler, AllChannel, and channels in general for spam prevention.

MaChengxin commented 7 years ago

The design sketch of the Dispersy community is almost done: https://github.com/MaChengxin/tribler/tree/swarm_size_community/Tribler/community/swarmsize Currently a swarm size detector simulator is used to mimic the behavior of the real one, because the real one still needs some polishing-up work. Once it is done it can easily replace the fake one.

MaChengxin commented 7 years ago

Result of the experiment with the swarm size detector simulator: https://jenkins.tribler.org/job/pers/job/SwarmSizeCommunity_Chengxin/32/artifact/output/localhost/node321/00000.out/*view*/ The nodes in the community are capable of receiving the results measured by other nodes. Note that the results are generated randomly, and will eventually be replaced by actual data.

MaChengxin commented 7 years ago

Questions

My task has three parts:

1.Distribute the entire work to different nodes (i.e. divide thousands of torrents into smaller sets and assign each set of torrents to nodes)

  1. Each single node solves its assigned task (i.e. estimate the swarm size from statistics or just ask the tracker)
  2. Aggregate the partial results (i.e. nodes share their information)

The questions are:

  1. For B I will invest more time into it, and for C I already know that I can use the Gossip protocol to share the info. However, I am not very clear about A. How can I do that in an elegant way?
  2. Between B and C, which subtask is more important?
synctext commented 7 years ago

Solution: no scheduling, division of work or result aggregation.

Proposed architecture in above design sketch is that each node just randomly checks 1 swarm per second. A total of 10 random content popularity checks are then shared with random neighbors. With this probabilistic approach it is highly likely that duplicate checks are conducted, however the code is simplified.

Next steps:

MaChengxin commented 7 years ago

figure_1

This figure shows how many checks one node has to make in order to cover all the torrents using a random pick-up policy (with replacement). It takes about 800 - 900 checks to cover 162 swarms.

More generally, if m denotes the number of torrents, then the expected number of checks required for total coverage is m*Hm, where Hm is the m-th Harmonic number. This gives us an impression of the order of magnitude of the required checks. In this experiment, m is 162, so m*Hm is about 918. The experiment result is close to the mathematical expectation.

We could improve this simple swarm selection policy by the follow strategies:

The new architecture looks like this: untitled diagram

Description: The basic idea is to keep two pools of swarms. One contains swarms to check (the running pool), and the other one (the standby pool) contains the checked ones. When the running pool becomes empty, we move the swarms in the standby pool back to it.

So for a certain swarm, it will first be selected from the running pool and its size will be measured. After this, a message ({swarm: size}) will be created. This message will go to a message handler. This handler does three things: store the info locally, gossip it to other nodes, and move the swarm from the running pool to pool 1 in the standby pool.

Pool 2 in the standby tool is responsible for receiving info about swarms measured by others. Upon receiving a new message {swarm:size}, it will store the info and put the swarm into pool 2 in the standby pool. (So when selecting a swarm to check in the running pool, we will need to first check if it is already in the standby pool.)

Bootstrapping: Initially, the two pools are both empty. A node will first load swarms known by itself to the running pool. Then the flow described above can start working.

Steady state: Since a node will measure the swarm sizes by itself and also receive such info from others, it will eventually know all the swarms in the ecosystem (e.g. Tribler). One interesting question is that if every node uses the same strategy, what is the estimation of the time needed to get every swarm measured at least once. I've already abstracted this question in a mathematical way: https://math.stackexchange.com/questions/2351742/expected-number-of-steps-to-walk-through-points-by-multiple-walkers

@synctext How do you think about this design? Do you see any design flaws?

devos50 commented 7 years ago

@MaChengxin Please take a look at our existing torrent checker solution: https://github.com/Tribler/tribler/tree/devel/Tribler/Core/TorrentChecker. This check simply queries the tracker for the size of the swarm. It is compatible with both HTTP and UDP trackers (it also supports DHT queries).

MaChengxin commented 7 years ago

untitled diagram

This is the new design for checking swarms. The dashed rounded square contains the checking module. A swarm is picked up randomly (without replacement) from the checking queue and put into one of the three lists due to its healthiness: healthy swarms, unhealthy swarms, and dead swarms. When checking queue is empty, swarms in the first two lists (healthy and unhealthy swarms) will be moved to the checking queue to start a new round of checking. (Criteria for defining the healthiness of the swarms is not yet determined. But basically the larger the swarm, the more healthy it is.) Outside the dashed rounded square are modules for communication, i.e. sending local measurement result to other node and receive remote results from them. The remote results are further processed and put in the three swarm lists (healthy, unhealthy, and dead.) Conflicting categorization: What if a swarm is said to be healthy by one node but unhealthy by others? The simplest solution is to ignore it and put it into both categories. Before being moved to the checking queue, the duplicated swarms in the two lists shall be removed.

synctext commented 7 years ago

Would suggest to remove the checking of dead swarms. Additionally, split the checking of healthy and unhealthy swarms. Use a three strikes algorithms for swarms. A swarm is declared dead and will never be checked again if zero seeders and zero leechers are found three times with at least 24h between measurements.

MaChengxin commented 7 years ago

Would suggest to remove the checking of dead swarms.

As shown in the figure, only healthy and unhealthy swarms will be put back in the checking queue. The dead swarms, colored in red, is like a black hole from where no swarm can escape.

Additionally, split the checking of healthy and unhealthy swarms.

It also seems to work if I check the healthy and unhealthy swarms together, and put different tags after checking on them indicating their difference.

MaChengxin commented 7 years ago

figure_2

This figure shows the experiment result of the initially proposed architecture (with one checking queue and two standby pools).

Experiment setup: Number of swarms to check: 162 Number of nodes: 10

The results are: Checks done by each node: around 40 Local unique swarms: 35 Total unique swarms: 159 Time: around 2 minutes Turning point of growth of unique swarms: around 65 seconds Corresponding number of checks: 65/120*400 = 217, for each node: around 22

The conclusion is that using 10 nodes, each of them will only need to check 22 swarms to cover (almost) all the swarms. Coverage effciency: 159/217 * 100% = 73.3% (Coverage effciency is defined as number of unique swarms / number of total checks)

If there is only one node, the theoretical coverage efficiency would be 100% (namely no duplicate checks). However, checking by one node would be tims-consuming. By the time 10 nodes have covered almost all the swarms (159), a single node has only checks only around 25 swarms.

Therefore, we can sacrifice some coverage efficiency to achieve high speed of checking.

synctext commented 7 years ago

First task: spend a few days writing problem description thesis chapter and intro. (storyline possible: numerous scientific papers require content popularity as critical input. We want a distributed Youtube. However, determining distributed state in a distributed system is hard. Very few designs have been proposed for this problem. )

Simple model:

Some simple parameters:

Code:

devos50 commented 7 years ago

You can use one of the lease web servers for testing purposes; we should disable the exit nodes running on these servers for one or two days. Technical specifications of one of the servers: https://gyazo.com/5eb8aa07d0a767afc7586acdc787ac7f (I would say you can run up to 24 Dispersy instances on one machine but you could try to scale this up using system load statistics).

MaChengxin commented 7 years ago

Filling an entire UDP packet (1480 Bytes on Ethernet) with swarm checks (20Byte SHA1, 2 Bytes Seeders, 2 Bytes leechers, 2 Bytes how many seconds ago check was done) {1480/ (20+2+2+2) = 56 entries}

According to Wiki the data length of a UDP packet is 65507 bytes. How does 1480 bytes come from?

The advantage of filling an entire UDP packet is that it reduces the cost of communication. However, this also means some data will not be sent out until a full UDP packet is filled. The consequence of such a delay might be that other nodes will do a double check because they don't know someone has already done the task. I will first try the eager-sharing strategy (share a result as soon as it is generated) and see if the communication cost is acceptable or not.

Share results of various last checks with a semi-fixed group of peers every 5 seconds.

Can we just gossip the last checks as soon as we have the results instead of forming a group and sharing the info within the group?

qstokkink commented 7 years ago

1480 bytes: https://stackoverflow.com/a/10239583

MaChengxin commented 7 years ago

Link to my thesis: https://github.com/MaChengxin/msc-thesis It is now badly written that only I myself understand what it is talking about. Will improve its readability.

synctext commented 7 years ago

Comments on first thesis draft

synctext commented 7 years ago

Possible first sentence of chapter 2: Our aim is to create a media system which is not controlled by a single commercial entity. Content discovery, browsing and search is critical for any media system. A key part of that is content popularity. The frontpage of a newspaper contains the most important news. In the digital age it is difficult to determine what is the most popular website, youtube clip, tweet, blog post, wikipedia edit or discussion comment. Especially in a distributed system, see the most popular tweet: image 3,693,759,511 views: https://www.youtube.com/watch?time_continue=5&v=kJQP7kiw5Fk

possible Chapter 3: Within this thesis we expanded an Internet-deployed open source distributed media system with content popularity. We focus on video because of the popularity of Youtube, Netflix, and Bittorrent. 3.1 The Tribler social media system 3.2 Content discovery and popularity in Bittorrent 3.3 Measuring popularity and attacks 3.5 : Create own chapter for this design and measurement work. Pearson correlation coefficient for 162 swarms?

Basic graphs of content popularity gossip community. Keep to the default Gumby plots. Screenshots of Android. Plots of Android. thesis DONE.

MaChengxin commented 7 years ago

ch4.pdf Up-to-date Chapter 4

To be discussed: outline of Chapter 5; experiments to be included in Chapter 5 Current outline:

For Chapter 1 to 3, I think it's better to revise them after finishing Chapter 4 and 5, such that I will know what aspects to emphasize

synctext commented 7 years ago

Comments:

synctext commented 6 years ago

First prototype deployment, moving work to new issue.