Swarm size community: content popularity

synctext commented 7 years ago

The popularity of content is essential. Currently we do not have a method to determine if a magnet links points to a large swarms or a dying swarm.

This work enables novel caching approaches, see "Efficient analysis of caching strategies under dynamic content popularity".

Each peer in the network must contact a swarm to determine the size. Checking 1 swarm every 5 seconds to conserve bandwidth means it takes many hours to check several thousand swarms. It is possible to share this information with your neighbors. By gossiping this swarm size information around, swarms need to be checked only once and bandwidth is saved.

The swarm size community is mechanism to collaboratively determine the size of swarms. The step from magnet link to swarm size involves numerous steps. Resolving a magnet link means using the DHT mechanism. Checking a swarm and doing a bittorrent handshake also takes numerous UDP packets and dealing with non-responsive IPv4 addresses.

Experiments:

App on Android Google Play market
determine the exact bytes for each packet in the DHT and Bittorrent handshake and total lookup.
determine the influence of swarm size on Video-on-demand performance (what is a healthy swarm size)
how many neighbors do you need to collaborate with to make cooperation efficient
credit mining / automagic caching of most popular content to boost system efficiency
caching on Android: donate all your resources to Tribler (boost network with old/unused phones)
boost with Raspberry Pi 3

Science:

synctext commented 7 years ago

first startup task:

create standalone program
Python wrapper around Libtorrent
cmdline tool provide magnet links as arguments
then tool fetches swarm info on DHT and contacts swarm
downloads a single piece of this swarm
print statistics of how many seeders and leechers we connected with
print total swarm piece availability

MaChengxin commented 7 years ago

I found that the official tutorial of libtorrent is a good reference, though the documentation isn't very much consistent with the code. I am running the sample code in nix-shell and it uses libtorrent-rasterbar-1.1.1. (The one in Ubuntu's package repository is a bit old. I can also build it but the building process looks complicated according to the doc.) The sample code is here. It has only downloading functionality and prints the status every 1 second. A .torrent file is used as the command line input. I am wondering what do you mean by "downloads a single piece of this swarm"? Does that mean downloading a piece of content that the swarm is sharing, or the info of the peers existing in the swarm? I am also confused about "print total swarm piece availability". Does it mean for every piece of the content, printing out the availability?

MaChengxin commented 7 years ago

screenshot from 2017-04-05 14-38-06 Task almost finished. In the screenshot: download progress, number of peers, number of seeders, number of leechers, and total swarm piece availability. Reference: https://github.com/Tribler/tribler/blob/90b85b55e80a2e6713dea85519a5ee29627ce9ea/Tribler/Core/Libtorrent/LibtorrentDownloadImpl.py

synctext commented 7 years ago

Can we obtain accurate statistics from Libtorrent. We desire the raw number of packets and length. Sadly, the peer_info, total_download, number of bytes "the total number of bytes downloaded from this peer. These numbers do not include the protocol chatter, but only the payload data"

There are session statistics with net.sent_payload_bytes and net.sent_ip_overhead_bytes

It seems accurate messages counter are kept:

name    type
ses.num_incoming_choke  counter
ses.num_incoming_unchoke    counter
ses.num_incoming_have   counter
ses.num_incoming_bitfield   counter
ses.num_incoming_request    counter
ses.num_incoming_piece  counter
ses.num_incoming_pex    counter
ses.num_incoming_metadata   counter
ses.num_incoming_extended   counter
ses.num_outgoing_choke  counter
ses.num_outgoing_unchoke    counter

Next step would be to determine the exact number of packet and bytes for downloading various Ubuntu images. The whole chain from magnet link to first downloaded piece.

The goal is single-byte accurate statistics Then we can calculate optimal strategies for connecting to swarms and sharing the swarm size handshake and crawling.

http://torrent.ubuntu.com:6969/file?info_hash=%2B%90%24%1F%8E%95%D5%3C%AC%F0%F8%C7%2A%92%E4kW%91%16%00
http://torrent.ubuntu.com:6969/file?info_hash=c%2B%8Dw%2A%90I%BD%27j%20%10%B2Q%B1J%5Cd%FC%DC

Final note, swarm size is not equal to the number of people that have seen this online video. Media consumption (content popularity) or downloads over time are very hard to estimate or measure. You would need to know the average in-swarm time, average downloading time and age of swarm to calculate the media consumption.

MaChengxin commented 7 years ago

The goal is single-byte accurate statistics Then we can calculate optimal strategies for connecting to swarms and sharing the swarm size handshake and crawling.

Can you please elaborate on how optimal strategies can be calculated based on "single-byte accurate statistics"?

Meanwhile, I was reading some literature about how to get (near) real-time info of # of seeders and leechers and came across "DHT Crawling". One of the literature is a PPT found here: https://www.defcon.org/images/defcon-18/dc-18-presentations/Wolchok/DEFCON-18-Wolchok-Crawling-Bittorrent-DHTS.pdf And the overview is in the screenshot: It also shows the popularity of a specific content over time: I am wondering if the "DHT crawling" thing is in line with what you think about determining swarm size. If yes I will read more about that topic to get more insight.

MaChengxin commented 7 years ago

Found a bittorent DHT crawler example (also in Python) on Github: https://github.com/nitmir/btdht-crawler Will try this first to see how seeder/leecher info is retrieved from DHT.

MaChengxin commented 7 years ago

Two more references: https://github.com/dontcontactme/simDHT https://github.com/blueskyz/DHTCrawler

-- Update -- zo 16 apr 2017 21:32:22 CEST -- simDHT just prints the infohash and the seeder's IP address and port number. It shows nothing about the number of seeders and leechers. screenshot from 2017-04-16 21-28-01 DHTCrawler shows the infohash and the "popularity" (as claimed in its README) of the torrent, so this one could be helpful.

MaChengxin commented 7 years ago

@synctext I have some doubt before proceeding with the experiment. Our ultimate goal is to fetch info about number of seeders and leechers (namely the swarm size) from DHT to show the real time popularity of the torrents. Naturally one possible solution path would be:

make a peer detect the number of seeders and leechers of a certain torrent
make a peer detect the number of seeders and leechers of multiple torrents
make multiple peers share their information by means of gossiping or something

The program made in the first starup task can only show the number of seeders and leechers that peer is currently connected with. It has no knowledge about how large the total swarm size is. So I prefer to solve this problem first. Two questions:

Do you agree with the solution path above?
Why do we desire the raw number of packets and length (single-byte accurate statistics)?

synctext commented 7 years ago

indeed. crawling of seeing the whole swarm is difficult. We can only partially see swarms. Then we can measure various swarms. Then we can see the relative size of various swarms. With the age+current swarm size we estimate popularity..

please ignore DHT crawling, too much spam.

MaChengxin commented 7 years ago

Current progress: I tried with a Raspbian ISO torrent instead of a Ubuntu one (will change it later). Now the program stops as soon as some pieces are downloaded, but it may not be exact one piece (could be two or more). From the downloading process, we can get the following info:

block size (a block is a sub piece)
number of downloaded pieces
total download
total payload download

from torrent status. It is likely that the overhead can be calculated by subtracting total payload download from total download. The following screen shot shows what can be acquired.

It still requires some effort to figure out which pieces of info matter, and why the total download and total payload download is 0 when the number of downloaded pieces is 2.

MaChengxin commented 7 years ago

Succeeded in retrieving statistics (see session stats in the screen shot), failed to interpret it.

See this issue for more info: https://github.com/arvidn/libtorrent/issues/1946

synctext commented 7 years ago

Next step is to get this operational and count the exact number of packets + their type. Goal is to measure all packets starting from a magnet link until exactly the first block is completed.

When this information from Libtorrent is combined with the packet length we get a sufficiently accurate picture. Experimental results for thesis: do this for 100-ish legal magnet links (e.g. magnet links from all Ubuntu torrents + other sources).

MaChengxin commented 7 years ago

After some effort it is finally possible to interpret the session statistics. The most relevant commit is: https://github.com/MaChengxin/expBT/commit/d0ef388aa99dbea6248292601a0ada6b71840492 In this commit, BitTorrent messages of various types are counted. According to Understanding BitTorrent: An Experimental Perspective, message types (described in pg 13-14) in BitTorrent version 4.0x include:

Type	Size (bytes)
HANDSHAKE (HS)	68
KEEP ALIVE (KA)	4
CHOKE (C)	5
UNCHOKE (UC)	5
INTERESTED (I)	5
NOT INTERESETED (NI)	5
HAVE (H)	9
BITFIELD (BF)	upper((# of pieces)/8)+5
REQUEST (R)	17
PIECE (P)	2^14+13 (if the block size is 2^14)
CANCEL (CA)	17

Most of them can be measured using libtorrent, as we can see is the commit above.

MaChengxin commented 7 years ago

The next steps would be to:

[x] remove the hard coded test torrent and automate the process of measuring different torrents
[x] store the measured statistics (in JSON)
[x] run the experiment
[x] find a way to skip the "bad" torrents that take a long time to download the first piece
[ ] analyze the statistics to profile the performance

MaChengxin commented 7 years ago

The experiment is running now using torrent files instead of magnet links (temporarily, otherwise it would be too slow). Maybe we need to add a timing function for each attempt to download the first piece? Time is also an important factor IMO.

synctext commented 7 years ago

true. time for dead torrent filtering also

MaChengxin commented 7 years ago

Time measurement is done: https://github.com/MaChengxin/expBT/commit/5a8f3293d1daafb2cfc1b72552bd08d3662e535f After running the experiment, I have got a folder containing JSON files, each of them holding the stats for a single torrent. I am wondering how we can make use of these stats to draw some meaningful conclusions. What info is expected to retrieve from the data?

MaChengxin commented 7 years ago

First trial of stats analysis: plot the histogram of the download time of all the Ubuntu images. figure_1

We see that most torrents can download the first piece within 10 seconds.

MaChengxin commented 7 years ago

figure_1-1 Update: reduced download time for each torrent by changing the way of checking pieces from time-driven to event-driven.

MaChengxin commented 7 years ago

Personal repository for the experiments: https://github.com/MaChengxin/expBT

MaChengxin commented 7 years ago

Just noticed that there is a page describing this issue with many details: https://www.tribler.org/SwarmSize/

synctext commented 7 years ago

Real swarm measurements. Roughly 15-KByte-ish of cost for sampling a swarm (also receive bytes?). Uses magnet links only. 160 Ubuntu swarms crawled: Experiments are serial. Only 1 swarm at a time, 500 second timeout per swarm or abort when the first piece is completed.

synctext commented 7 years ago

It seems to work! Further polish plots made during the measurements (many more here) Next step is to find more magnet links and run a detailed measurement of the correlation.

keep reading the related work

Thoughts: downloading 1 piece is good correlation. How good is downloading 16 pieces?

Does our content popularity estimation get more accurate if you spend more bandwidth and time within a swarm?

Dispersy community design sketch. Create a community which every second starts a new content popularity check. The outcome of this check is shared across the Dispersy overlay with 10 neighbors (fanout). That's it. Each peer in the network now obtains UDP packets with popularity starts on several swarms. 20170609_172509 Future steps: align with existing TorrentCollecting in Tribler, AllChannel, and channels in general for spam prevention.

MaChengxin commented 7 years ago

The design sketch of the Dispersy community is almost done: https://github.com/MaChengxin/tribler/tree/swarm_size_community/Tribler/community/swarmsize Currently a swarm size detector simulator is used to mimic the behavior of the real one, because the real one still needs some polishing-up work. Once it is done it can easily replace the fake one.

MaChengxin commented 7 years ago

Result of the experiment with the swarm size detector simulator: https://jenkins.tribler.org/job/pers/job/SwarmSizeCommunity_Chengxin/32/artifact/output/localhost/node321/00000.out/*view*/ The nodes in the community are capable of receiving the results measured by other nodes. Note that the results are generated randomly, and will eventually be replaced by actual data.

MaChengxin commented 7 years ago

Questions

My task has three parts:

1.Distribute the entire work to different nodes (i.e. divide thousands of torrents into smaller sets and assign each set of torrents to nodes)

Each single node solves its assigned task (i.e. estimate the swarm size from statistics or just ask the tracker)
Aggregate the partial results (i.e. nodes share their information)

The questions are:

For B I will invest more time into it, and for C I already know that I can use the Gossip protocol to share the info. However, I am not very clear about A. How can I do that in an elegant way?
Between B and C, which subtask is more important?

synctext commented 7 years ago

Solution: no scheduling, division of work or result aggregation.

Proposed architecture in above design sketch is that each node just randomly checks 1 swarm per second. A total of 10 random content popularity checks are then shared with random neighbors. With this probabilistic approach it is highly likely that duplicate checks are conducted, however the code is simplified.

Next steps:

first prototype of community with random check and random exchange
create PR with a community that is DISABLED by default
align with existing TorrentCollecting in Tribler, AllChannel, and channels in general for spam prevention.
integrate with search results (or task for Jelle)
enhance from random to semantic clustering and similarity functions (only check torrents similar to your taste)
thesis pictures:
- experimental Beta release on Tribler forum (25...250 users)
- run crawler in community
- size and growth of community (number of unique public keys in weeks)
- number of unique swarm hashes over time
- number of duplicate checks over time
- figures for dead/spam swarm with and without the content popularity check
- streaming experiment (assume DHT-only)
- check content popularity of random swarms
- correlation of content popularity and time to download first piece of content (or first 1-5 MByte)
- correlation of content popularity and measured on-demand streaming start delay
- correlation of content popularity and measured sustained streaming speed
- Run on Android...?...
DONE

MaChengxin commented 7 years ago

figure_1

This figure shows how many checks one node has to make in order to cover all the torrents using a random pick-up policy (with replacement). It takes about 800 - 900 checks to cover 162 swarms.

More generally, if m denotes the number of torrents, then the expected number of checks required for total coverage is m*Hm, where Hm is the m-th Harmonic number. This gives us an impression of the order of magnitude of the required checks. In this experiment, m is 162, so m*Hm is about 918. The experiment result is close to the mathematical expectation.

We could improve this simple swarm selection policy by the follow strategies:

random pick-up with replacement => random pick-up without replacement
pick up new swarms according to others' results

The new architecture looks like this: untitled diagram

Description: The basic idea is to keep two pools of swarms. One contains swarms to check (the running pool), and the other one (the standby pool) contains the checked ones. When the running pool becomes empty, we move the swarms in the standby pool back to it.

So for a certain swarm, it will first be selected from the running pool and its size will be measured. After this, a message ({swarm: size}) will be created. This message will go to a message handler. This handler does three things: store the info locally, gossip it to other nodes, and move the swarm from the running pool to pool 1 in the standby pool.

Pool 2 in the standby tool is responsible for receiving info about swarms measured by others. Upon receiving a new message {swarm:size}, it will store the info and put the swarm into pool 2 in the standby pool. (So when selecting a swarm to check in the running pool, we will need to first check if it is already in the standby pool.)

Bootstrapping: Initially, the two pools are both empty. A node will first load swarms known by itself to the running pool. Then the flow described above can start working.

Steady state: Since a node will measure the swarm sizes by itself and also receive such info from others, it will eventually know all the swarms in the ecosystem (e.g. Tribler). One interesting question is that if every node uses the same strategy, what is the estimation of the time needed to get every swarm measured at least once. I've already abstracted this question in a mathematical way: https://math.stackexchange.com/questions/2351742/expected-number-of-steps-to-walk-through-points-by-multiple-walkers

@synctext How do you think about this design? Do you see any design flaws?

devos50 commented 7 years ago

@MaChengxin Please take a look at our existing torrent checker solution: https://github.com/Tribler/tribler/tree/devel/Tribler/Core/TorrentChecker. This check simply queries the tracker for the size of the swarm. It is compatible with both HTTP and UDP trackers (it also supports DHT queries).

MaChengxin commented 7 years ago

untitled diagram

This is the new design for checking swarms. The dashed rounded square contains the checking module. A swarm is picked up randomly (without replacement) from the checking queue and put into one of the three lists due to its healthiness: healthy swarms, unhealthy swarms, and dead swarms. When checking queue is empty, swarms in the first two lists (healthy and unhealthy swarms) will be moved to the checking queue to start a new round of checking. (Criteria for defining the healthiness of the swarms is not yet determined. But basically the larger the swarm, the more healthy it is.) Outside the dashed rounded square are modules for communication, i.e. sending local measurement result to other node and receive remote results from them. The remote results are further processed and put in the three swarm lists (healthy, unhealthy, and dead.) Conflicting categorization: What if a swarm is said to be healthy by one node but unhealthy by others? The simplest solution is to ignore it and put it into both categories. Before being moved to the checking queue, the duplicated swarms in the two lists shall be removed.

synctext commented 7 years ago

Would suggest to remove the checking of dead swarms. Additionally, split the checking of healthy and unhealthy swarms. Use a three strikes algorithms for swarms. A swarm is declared dead and will never be checked again if zero seeders and zero leechers are found three times with at least 24h between measurements.

MaChengxin commented 7 years ago

Would suggest to remove the checking of dead swarms.

As shown in the figure, only healthy and unhealthy swarms will be put back in the checking queue. The dead swarms, colored in red, is like a black hole from where no swarm can escape.

Additionally, split the checking of healthy and unhealthy swarms.

It also seems to work if I check the healthy and unhealthy swarms together, and put different tags after checking on them indicating their difference.

MaChengxin commented 7 years ago

figure_2

This figure shows the experiment result of the initially proposed architecture (with one checking queue and two standby pools).

Experiment setup: Number of swarms to check: 162 Number of nodes: 10

The results are: Checks done by each node: around 40 Local unique swarms: 35 Total unique swarms: 159 Time: around 2 minutes Turning point of growth of unique swarms: around 65 seconds Corresponding number of checks: 65/120*400 = 217, for each node: around 22

The conclusion is that using 10 nodes, each of them will only need to check 22 swarms to cover (almost) all the swarms. Coverage effciency: 159/217 * 100% = 73.3% (Coverage effciency is defined as number of unique swarms / number of total checks)

If there is only one node, the theoretical coverage efficiency would be 100% (namely no duplicate checks). However, checking by one node would be tims-consuming. By the time 10 nodes have covered almost all the swarms (159), a single node has only checks only around 25 swarms.

Therefore, we can sacrifice some coverage efficiency to achieve high speed of checking.

synctext commented 7 years ago

First task: spend a few days writing problem description thesis chapter and intro. (storyline possible: numerous scientific papers require content popularity as critical input. We want a distributed Youtube. However, determining distributed state in a distributed system is hard. Very few designs have been proposed for this problem. )

Simple model:

Time between swarms checks
Cost in Bytes for an average swarm size check
Number of peers you are sharing swarm size data with
Size in bytes of a UDP packets when sharing N swarm size checks
Total amount of swarms in the known universe/universe we check reguarly
Cycle time is the time after which a swarm will be re-checked for content popularity

Some simple parameters:

Check every 30 seconds a random swarm, not yet checked in past 96 hours by you or others (120 checks/hour/peer)
Bandwidth usage is magnet DHT resolution, .torrent download, and the first content piece, guesstimated to be a total of 500 KByte.
Demanding the actual first piece of content creates an basic level of attack-resilience (hash_check)
Filling an entire UDP packet (1480 Bytes on Ethernet) with swarm checks (20Byte SHA1, 2 Bytes Seeders, 2 Bytes leechers, 2 Bytes how many seconds ago check was done) {1480/ (20+2+2+2) = 56 entries}
Share results of various last checks with a semi-fixed group of peers every 5 seconds {in a one-by-one manner} (thus 720 incoming checks/hour with Max. 56 entries = 40320)
Key question: how big is the group you share with?
Optimization criteria: cost (both checking and sharing) versus amount of known content popularity
Key problem is to devise a distributed algorithm to avoid double checking (vulnerable to misbehaving peers) or brute force it randomly with known overlap.

Code:

refactor the torrentchecker
add new ability to check size of torrent by downloading first piece
more costly in terms of bandwidth, but gives wealth of statistics, is attack-resilient, and only relies on DHT.
Keep it simple: store as little as possible in database
- keep checked swarms in memory only (results from previous session are not fresh and useless)
- grow this list with one entry every 30 seconds, until it reaches Max of 56 entries.
- 3-minute timeout for first piece download.
- You start checking after 30sec and always have something to share after 3min 30second
- dead/timeout swarms have seeders -1 and leechers -1 not Zero.
- store incoming swarm checks in a simple database (how to control growth of this DB? remove dead swarms?)
test on Gumby using Leaseweb servers

devos50 commented 7 years ago

You can use one of the lease web servers for testing purposes; we should disable the exit nodes running on these servers for one or two days. Technical specifications of one of the servers: https://gyazo.com/5eb8aa07d0a767afc7586acdc787ac7f (I would say you can run up to 24 Dispersy instances on one machine but you could try to scale this up using system load statistics).

MaChengxin commented 7 years ago

Filling an entire UDP packet (1480 Bytes on Ethernet) with swarm checks (20Byte SHA1, 2 Bytes Seeders, 2 Bytes leechers, 2 Bytes how many seconds ago check was done) {1480/ (20+2+2+2) = 56 entries}

According to Wiki the data length of a UDP packet is 65507 bytes. How does 1480 bytes come from?

The advantage of filling an entire UDP packet is that it reduces the cost of communication. However, this also means some data will not be sent out until a full UDP packet is filled. The consequence of such a delay might be that other nodes will do a double check because they don't know someone has already done the task. I will first try the eager-sharing strategy (share a result as soon as it is generated) and see if the communication cost is acceptable or not.

Share results of various last checks with a semi-fixed group of peers every 5 seconds.

Can we just gossip the last checks as soon as we have the results instead of forming a group and sharing the info within the group?

qstokkink commented 7 years ago

1480 bytes: https://stackoverflow.com/a/10239583

MaChengxin commented 7 years ago

Link to my thesis: https://github.com/MaChengxin/msc-thesis It is now badly written that only I myself understand what it is talking about. Will improve its readability.

synctext commented 7 years ago

Comments on first thesis draft

introduction Companies like Facebook continously need to decide what to show to their users, primarily their news feed. As shown this is a problem that a single profit-driven company now control media access for 2 billion people. These algorithms work as a closed black box without any accountability or consideration for public health through emotional contagion. With the dramatic social media usage levels of teens the question posed is even "has the smartphone destroyed a generation"
problem description Our aim is to create a media system which is not controlled by a single commercial entity. We want a distributed Youtube/Facebook/Twitter media system. Within this grand challenge we focus on estimating popularity. Knowing how popular content is without a central accounting entity is the key difficulty. Determining distributed state in a distributed system is hard. Our scientific goal is to create a content popularity estimating algorithm without any central entity, controlling server, or organisational embedding. Key related work is the DAO investment organisation which does not contain any central element, but was vulnerable to attack due to multiple programming bugs. Attack-resilience is also our key concern. An attacker should not be able to easily influence the content popularity. As academic studies have show, attacks on the media are real. With current social media technology the fake news is shown to be substantial. Fraud with media systems has been studied extensively and fake accounts have been detected. Our aim is to be able to devise open source algorithms and open innovation processes for detecting honest entities from attackers.
Content popularity in decentralised social media
- Our experimental research requires an operational system to expand with attack-resilient content popularity estimation algorithms and evaluate their effectiveness. Very few examples exist of decentralised social media systems. Over the past decades systems like Gnutella, Freenet, Publius, Eternity, Vanish, Bittorrent, Tribler, and Tangler have been developed. However, very few system have seen any popularity and continued active development. Tribler rules ! :-)
- Existing methods and algorithms
- new proposal: Measure various statistics when downloading a first swarm piece. Guess popularity.
- initial 162 swarms results, how accurate is it ? +- 50% ?
global system design and implementation In the previous chapter we look at our content popularity measurement algorithm. We now present a complete system out of this single component. Due to the limits of our work the accuracy is not yet very high.
evaluation and experiments If it works on Android, the experiments are much more impressive (and don't need to be very extensive). {two months spend on Android; redo Paul .apk code}

synctext commented 7 years ago

Possible first sentence of chapter 2: Our aim is to create a media system which is not controlled by a single commercial entity. Content discovery, browsing and search is critical for any media system. A key part of that is content popularity. The frontpage of a newspaper contains the most important news. In the digital age it is difficult to determine what is the most popular website, youtube clip, tweet, blog post, wikipedia edit or discussion comment. Especially in a distributed system, see the most popular tweet: 3,693,759,511 views: https://www.youtube.com/watch?time_continue=5&v=kJQP7kiw5Fk

possible Chapter 3: Within this thesis we expanded an Internet-deployed open source distributed media system with content popularity. We focus on video because of the popularity of Youtube, Netflix, and Bittorrent. 3.1 The Tribler social media system 3.2 Content discovery and popularity in Bittorrent 3.3 Measuring popularity and attacks 3.5 : Create own chapter for this design and measurement work. Pearson correlation coefficient for 162 swarms?

Basic graphs of content popularity gossip community. Keep to the default Gumby plots. Screenshots of Android. Plots of Android. thesis DONE.

MaChengxin commented 7 years ago

ch4.pdf Up-to-date Chapter 4

To be discussed: outline of Chapter 5; experiments to be included in Chapter 5 Current outline:

Rationale
Some theoretical analysis (based on simulation instead of pure mathematical analysis)
Implementation using Tribler

For Chapter 1 to 3, I think it's better to revise them after finishing Chapter 4 and 5, such that I will know what aspects to emphasize

synctext commented 7 years ago

Comments:

another problem description chapter potential addon. Twitter needs to keeps its sponsors happy. Olympics, trouble with branding.
Problem Description. First Decentral Social Media, Determining distributed state in a distributed system is hard. Then sub-problem to describe is content popularity.
Section 3.1 keep it short to 2 Max. pages (you're not a Tribler developer). + explain when presenting community results.
Attacks on BitTorrent. Keep in sync with thesis title, attack-resilience of Bittorrent. Make it possitive, it's pretty attack resilient. Paragraphs are too short, merge all into 1.
Chapter 4: informative title. Plus few intro sentences with keywords like: decentral, swarm sampling, sharing results. Mentioning 1936 distracts from storyline.
Language bindings of Libtorrent are not important.
max. figure size 4.1
has double text, below and above figure.
rename num_peers
Section 4.1: Limit the why you do this experiment. not 2 pages of engineering details.
remove all these details. My code can be found here.
Please remove trivial Algorithm 1.
Split single metric measurements and combining swarm metrics (voting is confusing).
Tables are Appendix details
Nobody in the system knows the Big Picture; everyone only has an partial knowledge of it : chapter 1 and 2 contents.
focus is a lot on 95+ % accuracy. Real system also needs to deal with 100k items and dead items. Show that in Chapter 5 and DONE. Get degree!

synctext commented 6 years ago

First prototype deployment, moving work to new issue.

Tribler / tribler

Swarm size community: content popularity #2783

Questions