Ullaakut / astronomer

A tool to detect illegitimate stars from bot accounts on GitHub projects
MIT License
703 stars 24 forks source link

Documenting the algorithm and providing justification evidence #45

Open changkun opened 5 years ago

changkun commented 5 years ago

Thank you for this very interesting project. Here I share a few of my tests while using the project.

I initially tested my personal project which has about 3.9k stars, the result seems wasn't so good.

$ docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/Users/changkun/dev/mct:/data/" ullaakut/astronomer changkun/modern-cpp-tutorial                                                                                          [22:00:10]
Beginning fetching process for repository changkun/modern-cpp-tutorial
Pre-fetching all stargazers...ok
  > Selecting 200 first stargazers out of 3930
  > Selecting 800 random stargazers out of 3930
Fetching contributions for 1000 users up to year 2013
Building trust report...ok

Averages                             Score           Trust
--------                             -----           -----
Weighted contributions:              4132              E
Private contributions:               65                E
Created issues:                      9                 D
Commits authored:                    238               C
Repositories:                        37                A
Pull requests:                       6                 E
Code reviews:                        2                 E
Account age (days):                  1444              B
5th percentile:                      9                 A
10th percentile:                     24                A
15th percentile:                     59                A
20th percentile:                     85                B
25th percentile:                     111               C
30th percentile:                     157               C
35th percentile:                     194               D
40th percentile:                     328               C
45th percentile:                     436               C
50th percentile:                     541               D
55th percentile:                     770               D
60th percentile:                     899               D
65th percentile:                     1255              D
70th percentile:                     1579              D
75th percentile:                     2599              D
80th percentile:                     3652              D
85th percentile:                     5277              E
90th percentile:                     6836              E
95th percentile:                     14190             E
----------------------------------------------------------
Overall trust:                                         D

✔ Analysis successful. 1000 users computed.
GitHub badge available at https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fastronomer.ullaakut.eu%2Fshields%3Fowner%3Dbilibili%26name%3Dkratos

Then, I picked another project from GitHub trend page:

$ docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/Users/changkun/dev/mct:/data/" ullaakut/astronomer bilibili/kratos                                                                                                       [22:12:59]
Beginning fetching process for repository bilibili/kratos
Pre-fetching all stargazers...ok
  > Selecting 200 first stargazers out of 5739
  > Selecting 800 random stargazers out of 5739
Fetching contributions for 1000 users up to year 2013
Building trust report...ok

Averages                             Score           Trust
--------                             -----           -----
Weighted contributions:              2536              E
Private contributions:               71                E
Created issues:                      6                 D
Commits authored:                    137               D
Repositories:                        30                A
Pull requests:                       6                 D
Code reviews:                        1                 E
Account age (days):                  1545              B
5th percentile:                      9                 A
10th percentile:                     25                A
15th percentile:                     43                A
20th percentile:                     55                C
25th percentile:                     74                D
30th percentile:                     106               D
35th percentile:                     146               D
40th percentile:                     188               D
45th percentile:                     245               D
50th percentile:                     349               D
55th percentile:                     490               D
60th percentile:                     638               E
65th percentile:                     832               E
70th percentile:                     1092              E
75th percentile:                     1577              E
80th percentile:                     2072              E
85th percentile:                     3117              E
90th percentile:                     5329              E
95th percentile:                     9192              E
----------------------------------------------------------
Overall trust:                                         D

✔ Analysis successful. 1000 users computed.
GitHub badge available at https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fastronomer.ullaakut.eu%2Fshields%3Fowner%3Dbilibili%26name%3Dkratos

OK, then let's test Tensorflow.

$ docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/Users/changkun/dev/mct:/data/" ullaakut/astronomer tensorflow/tensorflow                                                                                                 [23:32:47]
Beginning fetching process for repository tensorflow/tensorflow
Pre-fetching all stargazers...ok
  > Selecting 200 first stargazers out of 131149
  > Selecting 800 random stargazers out of 131149
Fetching contributions for 1000 users up to year 2013
Building trust report...ok

Averages                             Score           Trust
--------                             -----           -----
Weighted contributions:              7495              D
Private contributions:               190               C
Created issues:                      18                B
Commits authored:                    198               D
Repositories:                        16                C
Pull requests:                       10                D
Code reviews:                        3                 D
Account age (days):                  1145              C
5th percentile:                      1                 E
10th percentile:                     2                 E
15th percentile:                     5                 E
20th percentile:                     10                E
25th percentile:                     22                E
30th percentile:                     32                E
35th percentile:                     40                E
40th percentile:                     59                E
45th percentile:                     76                E
50th percentile:                     114               E
55th percentile:                     153               E
60th percentile:                     217               E
65th percentile:                     368               E
70th percentile:                     707               E
75th percentile:                     1076              E
80th percentile:                     2109              E
85th percentile:                     3390              E
90th percentile:                     14580             D
95th percentile:                     30685             D
----------------------------------------------------------
Overall trust:                                         D

✔ Analysis successful. 1000 users computed.
GitHub badge available at https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fastronomer.ullaakut.eu%2Fshields%3Fowner%3Dtensorflow%26name%3Dtensorflow

Issues to the Algorithm

This repo is proposing a justice algorithm without previous study on the ratio of algorithm. As a user of your algorithm, I particularly expect the following supporting points on why the algorithm is accurate:

  1. Showing theoretical analysis regarding the influence of each of the defined factors, and providing regression analysis and statistical stability of the algorithm.
  2. Making benchmarks on various projects, illustrates how your algorithm match the theoretical analysis for the TOP10 valuable open source projects, like golang/go, torvalds/linux, etc.

    "Those random stargazers can then sometimes be responsible for slight changes in the results, but they usually represent a difference of 1% to 3%, which is negligeable." -- README.md

    May I have how did you have this conclusion? How large is your test samples? What are they? etc.

  3. Establish a user study, an important way of evaluating usability issue is to held an user study. Typically, a single score has lack of expression on many different aspects, and it is not easy to say if the star of a repo is seriously fake or unworthy. Making quantitative analysis on, for example, how other users feel about the score provided by the algorithm, does the score matches your mental expectation? why? how could we help? those are questions should be seriously considered.
changkun commented 5 years ago

Additional results for torvalds/linux

$ docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/Users/changkun/dev/mct:/data/" ullaakut/astronomer torvalds/linux                                                                                                        [23:54:40]
Beginning fetching process for repository torvalds/linux
Pre-fetching all stargazers...ok
  > Selecting 200 first stargazers out of 77590
  > Selecting 800 random stargazers out of 77590
Fetching contributions for 1000 users up to year 2013
Building trust report...ok

Averages                             Score           Trust
--------                             -----           -----
Weighted contributions:              11055             C
Private contributions:               284               B
Created issues:                      13                C
Commits authored:                    236               C
Repositories:                        17                C
Pull requests:                       15                C
Code reviews:                        2                 E
Account age (days):                  1416              C
5th percentile:                      3                 D
10th percentile:                     4                 E
15th percentile:                     10                E
20th percentile:                     18                E
25th percentile:                     33                E
30th percentile:                     43                E
35th percentile:                     70                E
40th percentile:                     101               E
45th percentile:                     133               E
50th percentile:                     197               E
55th percentile:                     304               E
60th percentile:                     490               E
65th percentile:                     902               E
70th percentile:                     1235              E
75th percentile:                     2476              D
80th percentile:                     4888              D
85th percentile:                     8021              D
90th percentile:                     17211             C
95th percentile:                     42782             C
----------------------------------------------------------
Overall trust:                                         D

✔ Analysis successful. 1000 users computed.
GitHub badge available at https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fastronomer.ullaakut.eu%2Fshields%3Fowner%3Dtorvalds%26name%3Dlinux
Ullaakut commented 5 years ago

Hi @changkun ! Thanks for your tests and your very interesting suggestions :)

As stated in the README, my algorithm only attempts to estimate authenticity based on the user contributions. The repositories that you scanned are mostly starred by casual GitHub users, which makes the results quite low :/ For example, your repository being a tutorial, it will tend to be starred by CS students who are learning about C++ and thus their average contributions will tend to be lower than that of technical libraries.

I determined the ratios for what is "trustworthy" based on a sample of some GitHub open source repositories:

```js LiveSplit/LiveSplit bettercap/bettercap bxcodec/faker cenkalti/backoff containous/traefik d5/tengo derailed/k9s dgageot/demoit ehazlett/interlock envoyproxy/envoy fatih/color francoispqt/gojay gcla/termshark golang/proposal grafana/loki guptarohit/asciigraph hashicorp/raft iafan/goplayspace ikruglov/slapper imdario/mergo jlevesy/sind julienschmidt/httprouter kataras/iris knqyf263/trivy kubernetes/kops labstack/echo ldez/prm lukechampine/uint128 michenriksen/gitrob montanaflynn/stats moul/assh mvdan/gofumpt nektos/act notnil/chess olivere/elastic operator996/yaocl rancher/k3s rs/zerolog sirupsen/logrus spf13/cobra spf13/viper thoas/stats totoval/totoval tsenart/vegeta ullaakut/astronomer ullaakut/cameradar ullaakut/gorsair ullaakut/nmap ullaakut/rtspallthethings valyala/fasthttp vbauerster/mpb vektra/mockery zhangpeihao/gortmp ```

Unfortunately, this list is biased towards open source Go projects and libraries, which are most often used in technical projects, which means that the average stargazer of those projects is probably not representative of the average GitHub stargazer. That would explain the results you found.

  1. I wish I was able to do that, but I'm not a mathematician in any way so it would probably take me a lot of time. I welcome contributions from people with a better mathematical background than me though! As stated in the README.md, that's the greatest contribution one can do to this project.

  2. As for the impact of the random factor, of course it depends on the project, but thoughout my testing on all those repositories, it never made the result vary by more than +/- 3%. Of couse the sample is quite small, which is why I wrote usually, since I can't guarantee it will never be the case 🤷‍♂

  3. That could be interesting indeed, but I'm afraid that because of how the community currently tends to point fingers and speculate on their competition not having legitimate stargazers, this system would be biased/abused as well. What do you think?

I would love to discuss this more in depth with you since you seem to have great ideas and suggestions, and a better knowledge than me in this matter :)

changkun commented 5 years ago

Hi @Ullaakut. Sorry for my late response, and thank you for open the discussion regarding the algorithm.

In conclude, malicious stargazer detection is a challenging problem, you must sample enough data on GitHub and analyzing: 1) what are the essential factors, 2) what are the potential factors that you haven't considered at the moment, so on...

Ullaakut commented 5 years ago

Hey again @changkun !

Thanks for the details 🤔 I'm thinking of ways to have a better idea of the differences between legit and malicious stargazers, but the more I look into it the more challenging the problem becomes.

It turns out a lot of bought stars might be hacked accounts, or accounts from legit GitHub users who are simply getting paid to star projects when asked. (See https://gimhub.com/). And those are pretty much impossible to detect, compared to other users.

What my algorithm is good at detecting is basically any repository that used one of the old python scripts that were used about 3 years ago, to automatically create accounts and star a GitHub project, resulting in the first X accounts of the repository having absolutely 0 contributions and only starring a project.

Those scripts are no longer working though, since GH increased their account creation security by requiring email validation and a security challenge (using either vision or hearing).