Closed synctext closed 6 years ago
first startup task:
I found that the official tutorial of libtorrent is a good reference, though the documentation isn't very much consistent with the code. I am running the sample code in nix-shell and it uses libtorrent-rasterbar-1.1.1. (The one in Ubuntu's package repository is a bit old. I can also build it but the building process looks complicated according to the doc.) The sample code is here. It has only downloading functionality and prints the status every 1 second. A .torrent file is used as the command line input. I am wondering what do you mean by "downloads a single piece of this swarm"? Does that mean downloading a piece of content that the swarm is sharing, or the info of the peers existing in the swarm? I am also confused about "print total swarm piece availability". Does it mean for every piece of the content, printing out the availability?
Task almost finished. In the screenshot: download progress, number of peers, number of seeders, number of leechers, and total swarm piece availability. Reference: https://github.com/Tribler/tribler/blob/90b85b55e80a2e6713dea85519a5ee29627ce9ea/Tribler/Core/Libtorrent/LibtorrentDownloadImpl.py
Can we obtain accurate statistics from Libtorrent. We desire the raw number of packets and length. Sadly, the peer_info, total_download, number of bytes "the total number of bytes downloaded from this peer. These numbers do not include the protocol chatter, but only the payload data"
There are session statistics with net.sent_payload_bytes and net.sent_ip_overhead_bytes
It seems accurate messages counter are kept:
name type
ses.num_incoming_choke counter
ses.num_incoming_unchoke counter
ses.num_incoming_have counter
ses.num_incoming_bitfield counter
ses.num_incoming_request counter
ses.num_incoming_piece counter
ses.num_incoming_pex counter
ses.num_incoming_metadata counter
ses.num_incoming_extended counter
ses.num_outgoing_choke counter
ses.num_outgoing_unchoke counter
Next step would be to determine the exact number of packet and bytes for downloading various Ubuntu images. The whole chain from magnet link to first downloaded piece.
The goal is single-byte accurate statistics Then we can calculate optimal strategies for connecting to swarms and sharing the swarm size handshake and crawling.
http://torrent.ubuntu.com:6969/file?info_hash=%2B%90%24%1F%8E%95%D5%3C%AC%F0%F8%C7%2A%92%E4kW%91%16%00
http://torrent.ubuntu.com:6969/file?info_hash=c%2B%8Dw%2A%90I%BD%27j%20%10%B2Q%B1J%5Cd%FC%DC
Final note, swarm size is not equal to the number of people that have seen this online video. Media consumption (content popularity) or downloads over time are very hard to estimate or measure. You would need to know the average in-swarm time, average downloading time and age of swarm to calculate the media consumption.
The goal is single-byte accurate statistics Then we can calculate optimal strategies for connecting to swarms and sharing the swarm size handshake and crawling.
Can you please elaborate on how optimal strategies can be calculated based on "single-byte accurate statistics"?
Meanwhile, I was reading some literature about how to get (near) real-time info of # of seeders and leechers and came across "DHT Crawling". One of the literature is a PPT found here: https://www.defcon.org/images/defcon-18/dc-18-presentations/Wolchok/DEFCON-18-Wolchok-Crawling-Bittorrent-DHTS.pdf And the overview is in the screenshot: It also shows the popularity of a specific content over time: I am wondering if the "DHT crawling" thing is in line with what you think about determining swarm size. If yes I will read more about that topic to get more insight.
Found a bittorent DHT crawler example (also in Python) on Github: https://github.com/nitmir/btdht-crawler Will try this first to see how seeder/leecher info is retrieved from DHT.
Two more references: https://github.com/dontcontactme/simDHT https://github.com/blueskyz/DHTCrawler
-- Update -- zo 16 apr 2017 21:32:22 CEST -- simDHT just prints the infohash and the seeder's IP address and port number. It shows nothing about the number of seeders and leechers. DHTCrawler shows the infohash and the "popularity" (as claimed in its README) of the torrent, so this one could be helpful.
@synctext I have some doubt before proceeding with the experiment. Our ultimate goal is to fetch info about number of seeders and leechers (namely the swarm size) from DHT to show the real time popularity of the torrents. Naturally one possible solution path would be:
The program made in the first starup task can only show the number of seeders and leechers that peer is currently connected with. It has no knowledge about how large the total swarm size is. So I prefer to solve this problem first. Two questions:
indeed. crawling of seeing the whole swarm is difficult. We can only partially see swarms. Then we can measure various swarms. Then we can see the relative size of various swarms. With the age+current swarm size we estimate popularity..
please ignore DHT crawling, too much spam.
Current progress: I tried with a Raspbian ISO torrent instead of a Ubuntu one (will change it later). Now the program stops as soon as some pieces are downloaded, but it may not be exact one piece (could be two or more). From the downloading process, we can get the following info:
from torrent status. It is likely that the overhead can be calculated by subtracting total payload download from total download. The following screen shot shows what can be acquired.
It still requires some effort to figure out which pieces of info matter, and why the total download and total payload download is 0 when the number of downloaded pieces is 2.
Succeeded in retrieving statistics (see session stats in the screen shot), failed to interpret it.
See this issue for more info: https://github.com/arvidn/libtorrent/issues/1946
Next step is to get this operational and count the exact number of packets + their type. Goal is to measure all packets starting from a magnet link until exactly the first block is completed.
When this information from Libtorrent is combined with the packet length we get a sufficiently accurate picture. Experimental results for thesis: do this for 100-ish legal magnet links (e.g. magnet links from all Ubuntu torrents + other sources).
After some effort it is finally possible to interpret the session statistics. The most relevant commit is: https://github.com/MaChengxin/expBT/commit/d0ef388aa99dbea6248292601a0ada6b71840492 In this commit, BitTorrent messages of various types are counted. According to Understanding BitTorrent: An Experimental Perspective, message types (described in pg 13-14) in BitTorrent version 4.0x include:
Type | Size (bytes) |
---|---|
HANDSHAKE (HS) | 68 |
KEEP ALIVE (KA) | 4 |
CHOKE (C) | 5 |
UNCHOKE (UC) | 5 |
INTERESTED (I) | 5 |
NOT INTERESETED (NI) | 5 |
HAVE (H) | 9 |
BITFIELD (BF) | upper((# of pieces)/8)+5 |
REQUEST (R) | 17 |
PIECE (P) | 2^14+13 (if the block size is 2^14) |
CANCEL (CA) | 17 |
Most of them can be measured using libtorrent, as we can see is the commit above.
The next steps would be to:
The experiment is running now using torrent files instead of magnet links (temporarily, otherwise it would be too slow). Maybe we need to add a timing function for each attempt to download the first piece? Time is also an important factor IMO.
true. time for dead torrent filtering also
Time measurement is done: https://github.com/MaChengxin/expBT/commit/5a8f3293d1daafb2cfc1b72552bd08d3662e535f After running the experiment, I have got a folder containing JSON files, each of them holding the stats for a single torrent. I am wondering how we can make use of these stats to draw some meaningful conclusions. What info is expected to retrieve from the data?
First trial of stats analysis: plot the histogram of the download time of all the Ubuntu images.
We see that most torrents can download the first piece within 10 seconds.
Update: reduced download time for each torrent by changing the way of checking pieces from time-driven to event-driven.
Personal repository for the experiments: https://github.com/MaChengxin/expBT
Just noticed that there is a page describing this issue with many details: https://www.tribler.org/SwarmSize/
Real swarm measurements. Roughly 15-KByte-ish of cost for sampling a swarm (also receive bytes?). Uses magnet links only. 160 Ubuntu swarms crawled: Experiments are serial. Only 1 swarm at a time, 500 second timeout per swarm or abort when the first piece is completed.
It seems to work! Further polish plots made during the measurements (many more here) Next step is to find more magnet links and run a detailed measurement of the correlation.
Thoughts: downloading 1 piece is good correlation. How good is downloading 16 pieces?
Does our content popularity estimation get more accurate if you spend more bandwidth and time within a swarm?
Dispersy community design sketch. Create a community which every second starts a new content popularity check. The outcome of this check is shared across the Dispersy overlay with 10 neighbors (fanout). That's it. Each peer in the network now obtains UDP packets with popularity starts on several swarms. Future steps: align with existing TorrentCollecting in Tribler, AllChannel, and channels in general for spam prevention.
The design sketch of the Dispersy community is almost done: https://github.com/MaChengxin/tribler/tree/swarm_size_community/Tribler/community/swarmsize Currently a swarm size detector simulator is used to mimic the behavior of the real one, because the real one still needs some polishing-up work. Once it is done it can easily replace the fake one.
Result of the experiment with the swarm size detector simulator: https://jenkins.tribler.org/job/pers/job/SwarmSizeCommunity_Chengxin/32/artifact/output/localhost/node321/00000.out/*view*/ The nodes in the community are capable of receiving the results measured by other nodes. Note that the results are generated randomly, and will eventually be replaced by actual data.
My task has three parts:
1.Distribute the entire work to different nodes (i.e. divide thousands of torrents into smaller sets and assign each set of torrents to nodes)
The questions are:
Solution: no scheduling, division of work or result aggregation.
Proposed architecture in above design sketch is that each node just randomly checks 1 swarm per second. A total of 10 random content popularity checks are then shared with random neighbors. With this probabilistic approach it is highly likely that duplicate checks are conducted, however the code is simplified.
Next steps:
This figure shows how many checks one node has to make in order to cover all the torrents using a random pick-up policy (with replacement). It takes about 800 - 900 checks to cover 162 swarms.
More generally, if m denotes the number of torrents, then the expected number of checks required for total coverage is m*Hm, where Hm is the m-th Harmonic number. This gives us an impression of the order of magnitude of the required checks. In this experiment, m is 162, so m*Hm is about 918. The experiment result is close to the mathematical expectation.
We could improve this simple swarm selection policy by the follow strategies:
The new architecture looks like this:
Description: The basic idea is to keep two pools of swarms. One contains swarms to check (the running pool), and the other one (the standby pool) contains the checked ones. When the running pool becomes empty, we move the swarms in the standby pool back to it.
So for a certain swarm, it will first be selected from the running pool and its size will be measured. After this, a message ({swarm: size}) will be created. This message will go to a message handler. This handler does three things: store the info locally, gossip it to other nodes, and move the swarm from the running pool to pool 1 in the standby pool.
Pool 2 in the standby tool is responsible for receiving info about swarms measured by others. Upon receiving a new message {swarm:size}, it will store the info and put the swarm into pool 2 in the standby pool. (So when selecting a swarm to check in the running pool, we will need to first check if it is already in the standby pool.)
Bootstrapping: Initially, the two pools are both empty. A node will first load swarms known by itself to the running pool. Then the flow described above can start working.
Steady state: Since a node will measure the swarm sizes by itself and also receive such info from others, it will eventually know all the swarms in the ecosystem (e.g. Tribler). One interesting question is that if every node uses the same strategy, what is the estimation of the time needed to get every swarm measured at least once. I've already abstracted this question in a mathematical way: https://math.stackexchange.com/questions/2351742/expected-number-of-steps-to-walk-through-points-by-multiple-walkers
@synctext How do you think about this design? Do you see any design flaws?
@MaChengxin Please take a look at our existing torrent checker solution: https://github.com/Tribler/tribler/tree/devel/Tribler/Core/TorrentChecker. This check simply queries the tracker for the size of the swarm. It is compatible with both HTTP and UDP trackers (it also supports DHT queries).
This is the new design for checking swarms. The dashed rounded square contains the checking module. A swarm is picked up randomly (without replacement) from the checking queue and put into one of the three lists due to its healthiness: healthy swarms, unhealthy swarms, and dead swarms. When checking queue is empty, swarms in the first two lists (healthy and unhealthy swarms) will be moved to the checking queue to start a new round of checking. (Criteria for defining the healthiness of the swarms is not yet determined. But basically the larger the swarm, the more healthy it is.) Outside the dashed rounded square are modules for communication, i.e. sending local measurement result to other node and receive remote results from them. The remote results are further processed and put in the three swarm lists (healthy, unhealthy, and dead.) Conflicting categorization: What if a swarm is said to be healthy by one node but unhealthy by others? The simplest solution is to ignore it and put it into both categories. Before being moved to the checking queue, the duplicated swarms in the two lists shall be removed.
Would suggest to remove the checking of dead swarms. Additionally, split the checking of healthy and unhealthy swarms. Use a three strikes algorithms for swarms. A swarm is declared dead and will never be checked again if zero seeders and zero leechers are found three times with at least 24h between measurements.
Would suggest to remove the checking of dead swarms.
As shown in the figure, only healthy and unhealthy swarms will be put back in the checking queue. The dead swarms, colored in red, is like a black hole from where no swarm can escape.
Additionally, split the checking of healthy and unhealthy swarms.
It also seems to work if I check the healthy and unhealthy swarms together, and put different tags after checking on them indicating their difference.
This figure shows the experiment result of the initially proposed architecture (with one checking queue and two standby pools).
Experiment setup: Number of swarms to check: 162 Number of nodes: 10
The results are: Checks done by each node: around 40 Local unique swarms: 35 Total unique swarms: 159 Time: around 2 minutes Turning point of growth of unique swarms: around 65 seconds Corresponding number of checks: 65/120*400 = 217, for each node: around 22
The conclusion is that using 10 nodes, each of them will only need to check 22 swarms to cover (almost) all the swarms. Coverage effciency: 159/217 * 100% = 73.3% (Coverage effciency is defined as number of unique swarms / number of total checks)
If there is only one node, the theoretical coverage efficiency would be 100% (namely no duplicate checks). However, checking by one node would be tims-consuming. By the time 10 nodes have covered almost all the swarms (159), a single node has only checks only around 25 swarms.
Therefore, we can sacrifice some coverage efficiency to achieve high speed of checking.
First task: spend a few days writing problem description thesis chapter and intro. (storyline possible: numerous scientific papers require content popularity as critical input. We want a distributed Youtube. However, determining distributed state in a distributed system is hard. Very few designs have been proposed for this problem. )
Simple model:
Time between swarms checks
Cost in Bytes for an average swarm size check
Number of peers you are sharing swarm size data with
Size in bytes of a UDP packets when sharing N swarm size checks
Total amount of swarms in the known universe/universe we check reguarly
Cycle time is the time after which a swarm will be re-checked for content popularity
Some simple parameters:
hash_check
)Code:
You can use one of the lease web servers for testing purposes; we should disable the exit nodes running on these servers for one or two days. Technical specifications of one of the servers: https://gyazo.com/5eb8aa07d0a767afc7586acdc787ac7f (I would say you can run up to 24 Dispersy instances on one machine but you could try to scale this up using system load statistics).
Filling an entire UDP packet (1480 Bytes on Ethernet) with swarm checks (20Byte SHA1, 2 Bytes Seeders, 2 Bytes leechers, 2 Bytes how many seconds ago check was done) {1480/ (20+2+2+2) = 56 entries}
According to Wiki the data length of a UDP packet is 65507 bytes. How does 1480 bytes come from?
The advantage of filling an entire UDP packet is that it reduces the cost of communication. However, this also means some data will not be sent out until a full UDP packet is filled. The consequence of such a delay might be that other nodes will do a double check because they don't know someone has already done the task. I will first try the eager-sharing strategy (share a result as soon as it is generated) and see if the communication cost is acceptable or not.
Share results of various last checks with a semi-fixed group of peers every 5 seconds.
Can we just gossip the last checks as soon as we have the results instead of forming a group and sharing the info within the group?
1480 bytes: https://stackoverflow.com/a/10239583
Link to my thesis: https://github.com/MaChengxin/msc-thesis It is now badly written that only I myself understand what it is talking about. Will improve its readability.
Comments on first thesis draft
Determining distributed state in a distributed system is hard.
Our scientific goal is to create a content popularity estimating algorithm without any central entity, controlling server, or organisational embedding. Key related work is the DAO investment organisation which does not contain any central element, but was vulnerable to attack due to multiple programming bugs. Attack-resilience is also our key concern. An attacker should not be able to easily influence the content popularity. As academic studies have show, attacks on the media are real. With current social media technology the fake news is shown to be substantial. Fraud with media systems has been studied extensively and fake accounts have been detected. Our aim is to be able to devise open source algorithms and open innovation processes for detecting honest entities from attackers.Possible first sentence of chapter 2: Our aim is to create a media system which is not controlled by a single commercial entity. Content discovery, browsing and search is critical for any media system. A key part of that is content popularity. The frontpage of a newspaper contains the most important news. In the digital age it is difficult to determine what is the most popular website, youtube clip, tweet, blog post, wikipedia edit or discussion comment. Especially in a distributed system, see the most popular tweet: 3,693,759,511 views: https://www.youtube.com/watch?time_continue=5&v=kJQP7kiw5Fk
possible Chapter 3: Within this thesis we expanded an Internet-deployed open source distributed media system with content popularity. We focus on video because of the popularity of Youtube, Netflix, and Bittorrent. 3.1 The Tribler social media system 3.2 Content discovery and popularity in Bittorrent 3.3 Measuring popularity and attacks 3.5 : Create own chapter for this design and measurement work. Pearson correlation coefficient for 162 swarms?
Basic graphs of content popularity gossip community. Keep to the default Gumby plots. Screenshots of Android. Plots of Android. thesis DONE.
ch4.pdf Up-to-date Chapter 4
To be discussed: outline of Chapter 5; experiments to be included in Chapter 5 Current outline:
For Chapter 1 to 3, I think it's better to revise them after finishing Chapter 4 and 5, such that I will know what aspects to emphasize
Comments:
Determining distributed state in a distributed system is hard
. Then sub-problem to describe is content popularity.num_peers
First prototype deployment, moving work to new issue.
The popularity of content is essential. Currently we do not have a method to determine if a magnet links points to a large swarms or a dying swarm.
This work enables novel caching approaches, see "Efficient analysis of caching strategies under dynamic content popularity".
Each peer in the network must contact a swarm to determine the size. Checking 1 swarm every 5 seconds to conserve bandwidth means it takes many hours to check several thousand swarms. It is possible to share this information with your neighbors. By gossiping this swarm size information around, swarms need to be checked only once and bandwidth is saved.
The swarm size community is mechanism to collaboratively determine the size of swarms. The step from magnet link to swarm size involves numerous steps. Resolving a magnet link means using the DHT mechanism. Checking a swarm and doing a bittorrent handshake also takes numerous UDP packets and dealing with non-responsive IPv4 addresses.
Experiments:
Science: