Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.8k stars 444 forks source link

User community insight using an improved crawler #1750

Open synctext opened 8 years ago

synctext commented 8 years ago

Goal: a validated, documented and reliable crawler to understand user behavior. This enables the future step of measuring behavioral change.

We have an existing Crawler for Dispersy communities and Tribler. The general Tribler crawler stopped being updated in 2013. See: http://Statistics.tribler.org This is annotated with our releases and major news events. However, totally unmaintained and difficult to maintain. image

This crawler needs to move to a proxmox machine and improved. Improved insight will help us understand the network health and roadmap.

Expected results: real-time daily graphs of Tribler network size: image

User upgrade behavior: image

Examples taken from: http://crawler.doxu.org/uptimes.html

image

ToDo: NAT type as reported by Dispersy in our community and evolution in time.

According this Github downloads stats we have 302000 downloads of Tribler. http://www.somsubhra.com/github-release-stats/?username=tribler&repository=tribler However, our non-validated, many-years-old crawler only sees a few thousand users.

The thesis of Niels contains an extensive user community evaluation and "data science" portion. http://www.tribler.org/SimilarityFunction/ Thesis.pdf: http://kayapo.tribler.org/trac/raw-attachment/wiki/SimilarityFunction/thesis.pdf

Current setup: Kayapo web space: /var/www/statistics.tribler.org/htdocs/img/ Soft links to: /home/tribler/generate-periodic-statistics kayapo:/home/tribler/generate-periodic-statistics# wc -l *.py 193 first_last.py 191 parse.py 169 reduce.py 553 total

Some crawlers have died a few years ago:

kayapo:/home/tribler/generate-periodic-statistics# ls -lah /collected/logs/superpeer
total 3.4M
drwxr-xr-x 26 tribler tribler 4.0K Jan 29  2015 .
drwxr-xr-x  8 tribler tribler 4.0K Feb  7  2014 ..
drwxr-xr-x  2 tribler tribler  12K May 23  2012 dispersy-tracker-1
drwxr-xr-x  2 tribler tribler  16K May 23  2012 dispersy-tracker-2
drwxr-xr-x  2 tribler tribler  12K Feb 10  2012 dispersy-tracker-3
drwxr-xr-x  2 tribler tribler  12K Feb 10  2012 dispersy-tracker-4
drwxr-xr-x  2 tribler tribler  20K May 23  2012 dispersy-tracker-5
drwxr-xr-x  2 tribler tribler  20K May 23  2012 dispersy-tracker-6
drwxr-xr-x  2 tribler tribler 764K Nov 25 05:15 dispersy-tracker-6421-kayapo
drwxr-xr-x  2 tribler tribler 740K Feb  9  2015 dispersy-tracker-6422-kayapo
drwxr-xr-x  2 tribler tribler  80K Sep 24  2012 dispersy-tracker-6423-om.cs.vu.nl
drwxr-xr-x  2 tribler tribler  20K Nov 25 05:16 dispersy-tracker-6424-leaseweb
drwxr-xr-x  2 tribler tribler  84K Sep 24  2012 dispersy-tracker-6424-om.cs.vu.nl
drwxr-xr-x  2 tribler tribler 180K Nov 21 05:16 dispersy-tracker-6425-asmat
drwxr-xr-x  2 tribler tribler 172K Nov 22 05:16 dispersy-tracker-6426-asmat
drwxr-xr-x  2 tribler tribler 340K Aug  3 05:17 dispersy-tracker-6427-pygmee
drwxr-xr-x  2 tribler tribler 340K Aug  3 05:17 dispersy-tracker-6428-pygmee
drwxr-xr-x  2 tribler tribler  20K Nov 25 05:16 dispersy-tracker-6434-leaseweb
drwxr-xr-x  2 tribler tribler  72K Nov 25 01:34 superpeer1
drwxr-xr-x  2 tribler tribler  80K Aug 16  2010 superpeer2
drwxr-xr-x  2 tribler tribler  96K Sep 24  2012 superpeer3
drwxr-xr-x  2 tribler tribler  20K Sep 24  2012 superpeer4
drwxr-xr-x  2 tribler tribler  48K Feb  2  2010 superpeer5
drwxr-xr-x  2 tribler tribler  72K Nov 25 04:04 superpeer6
drwxr-xr-x  2 tribler tribler  92K Nov 25 04:35 superpeer7
drwxr-xr-x  2 tribler tribler  84K Nov 25 05:05 superpeer8
synctext commented 8 years ago

Related to multichain crawling. We don't want to spy on our users for profit, but identify fault, failures, and points for improvements. Respect privacy, no exposure of any individual, and only provide insight into the global system behavior. #2532 #1429

qstokkink commented 6 years ago

http://statistics.tribler.org/ is back with IPv8 showing user communities, we just need longer term statistics now.

qstokkink commented 1 month ago

A 2024 update: we now have multiple crawlers but they do not meet the original goal of OP. They are semi-validated, not documented, and not reliable. Frankly, we have too many crawlers: I have a hard time remembering what we even have running.