Open changkun opened 5 years ago
Additional results for torvalds/linux
$ docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/Users/changkun/dev/mct:/data/" ullaakut/astronomer torvalds/linux [23:54:40]
Beginning fetching process for repository torvalds/linux
Pre-fetching all stargazers...ok
> Selecting 200 first stargazers out of 77590
> Selecting 800 random stargazers out of 77590
Fetching contributions for 1000 users up to year 2013
Building trust report...ok
Averages Score Trust
-------- ----- -----
Weighted contributions: 11055 C
Private contributions: 284 B
Created issues: 13 C
Commits authored: 236 C
Repositories: 17 C
Pull requests: 15 C
Code reviews: 2 E
Account age (days): 1416 C
5th percentile: 3 D
10th percentile: 4 E
15th percentile: 10 E
20th percentile: 18 E
25th percentile: 33 E
30th percentile: 43 E
35th percentile: 70 E
40th percentile: 101 E
45th percentile: 133 E
50th percentile: 197 E
55th percentile: 304 E
60th percentile: 490 E
65th percentile: 902 E
70th percentile: 1235 E
75th percentile: 2476 D
80th percentile: 4888 D
85th percentile: 8021 D
90th percentile: 17211 C
95th percentile: 42782 C
----------------------------------------------------------
Overall trust: D
✔ Analysis successful. 1000 users computed.
GitHub badge available at https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fastronomer.ullaakut.eu%2Fshields%3Fowner%3Dtorvalds%26name%3Dlinux
Hi @changkun ! Thanks for your tests and your very interesting suggestions :)
As stated in the README, my algorithm only attempts to estimate authenticity based on the user contributions. The repositories that you scanned are mostly starred by casual GitHub users, which makes the results quite low :/ For example, your repository being a tutorial, it will tend to be starred by CS students who are learning about C++ and thus their average contributions will tend to be lower than that of technical libraries.
I determined the ratios for what is "trustworthy" based on a sample of some GitHub open source repositories:
Unfortunately, this list is biased towards open source Go projects and libraries, which are most often used in technical projects, which means that the average stargazer of those projects is probably not representative of the average GitHub stargazer. That would explain the results you found.
I wish I was able to do that, but I'm not a mathematician in any way so it would probably take me a lot of time. I welcome contributions from people with a better mathematical background than me though! As stated in the README.md
, that's the greatest contribution one can do to this project.
As for the impact of the random factor, of course it depends on the project, but thoughout my testing on all those repositories, it never made the result vary by more than +/- 3%. Of couse the sample is quite small, which is why I wrote usually, since I can't guarantee it will never be the case 🤷♂
That could be interesting indeed, but I'm afraid that because of how the community currently tends to point fingers and speculate on their competition not having legitimate stargazers, this system would be biased/abused as well. What do you think?
I would love to discuss this more in depth with you since you seem to have great ideas and suggestions, and a better knowledge than me in this matter :)
Hi @Ullaakut. Sorry for my late response, and thank you for open the discussion regarding the algorithm.
Regarding randomization of the algorithm: this actually relies on the statistic stability of the algorithm, that is exactly why I pointed in the first place. Typically, if you well formalize your evaluation function in mathematics, then you could easily describe the rate of change in each factor by simply compute the gradient of your formula.
Regarding user study: Your algorithm actually involves more social complexity than other similar repository evaluations, such as Go report card regarding the code quality.
In go report card
case, it evaluates the code quality based on developer's common best practices, therefore it's results are pretty much reliable. Vise Versa, in your algorithm, we have to ask ourselves: what is the meaning of "likely" in the question of "how likely it is that those users are real humans"? How could we properly define "likely"? Those are open questions to everyone. More deeply, do the stargazers who always stars useful repositories have many open source contribution on GitHub? I am afraid the answer to the question is "No". The "star" function (even "fork") to most of the users are simply like "bookmark", which allow them to quickly revisit the repository somewhere in the future. They could either actively contribute on GitHub or never open a public repo because of some personal reason.
Moreover, do people contribute to GitHub very often star many useful projects? The answer perhaps is a "no" again, they probably do not star projects at all. Many similar questions could be asked.
Regarding biased/abuses: This is the actual reason why you need to explain the algorithm very well to improve people's trusts :)
In conclude, malicious stargazer detection is a challenging problem, you must sample enough data on GitHub and analyzing: 1) what are the essential factors, 2) what are the potential factors that you haven't considered at the moment, so on...
Hey again @changkun !
Thanks for the details 🤔 I'm thinking of ways to have a better idea of the differences between legit and malicious stargazers, but the more I look into it the more challenging the problem becomes.
It turns out a lot of bought stars might be hacked accounts, or accounts from legit GitHub users who are simply getting paid to star projects when asked. (See https://gimhub.com/). And those are pretty much impossible to detect, compared to other users.
What my algorithm is good at detecting is basically any repository that used one of the old python scripts that were used about 3 years ago, to automatically create accounts and star a GitHub project, resulting in the first X accounts of the repository having absolutely 0 contributions and only starring a project.
Those scripts are no longer working though, since GH increased their account creation security by requiring email validation and a security challenge (using either vision or hearing).
Thank you for this very interesting project. Here I share a few of my tests while using the project.
I initially tested my personal project which has about 3.9k stars, the result seems wasn't so good.
Then, I picked another project from GitHub trend page:
OK, then let's test Tensorflow.
Issues to the Algorithm
This repo is proposing a justice algorithm without previous study on the ratio of algorithm. As a user of your algorithm, I particularly expect the following supporting points on why the algorithm is accurate:
Making benchmarks on various projects, illustrates how your algorithm match the theoretical analysis for the TOP10 valuable open source projects, like golang/go, torvalds/linux, etc.
May I have how did you have this conclusion? How large is your test samples? What are they? etc.