Better krihelimeter calculation

Nagasaki45 commented 8 years ago

Take the DB, apply PCA to 1 dimension, use coefficients.

Nagasaki45 commented 8 years ago

Although the above might work, there is no need to be unopinionated here. I tend to go with an opinionated version instead, which is, in the simplest case, deciding about weights for the stats.

For example, having the following coefficients:

1 point per commit
20 points per contributor
8 points for anything else

Alternatively, @cool-RR suggested the following, dynamic, calculation: For each statistic (commits, contributors, merged PR, etc.) calculate the percentile of the repo. Then, add the percentiles of all statistics together to get the krihelimeter. This will ensure that the krihelimeter will be bounded between 0 and 100 * num_of_statistics.

If someone have better idea for calculating the krihelimeter please do tell.

Nagasaki45 commented 8 years ago

See f3d2ec22adcfd1060944a69c2a62b28659b5c131. The above basic calculation was implemented.

cool-RR commented 8 years ago

Looks good :)

Nagasaki45 commented 7 years ago

I'm satisfied with the current calculation. See no reason to change it soon. Therefore, I'm closing the ticket.

sglienke commented 7 years ago

FWIW I think that adding points per contributor flat is creating a bit of imbalance for smaller projects.

Lets say I have:

a project with 3 authors that did 20 commits (18 by one and 1 by the other two each and they were contributed via pull request which were fixes for reported issues): 112 pts (3 20 + 20 1 + 4 * 8)
another project with one author that did 60 commits: 80 pts

I suggest adding points for commits, PRs, issues and then multiply them with a coefficient for the authors. So projects that have more contributors get a slightly higher rating than one with less contributors without being completely imbalanced.

Nagasaki45 commented 7 years ago

First, thanks for the feedback!

I suggest adding points for commits, PRs, issues and then multiply them with a coefficient for the authors.

I'm not quite sure what do you mean. With this suggestion the difference between the two scenarios you provided will be even bigger, isn't it? Assuming all the weights for commits / issues / PRs remain the same

the first scenario will get: (20 1 + 4 8) * 3 = 156.
and the 2nd senario will get: (60 1) 1 = 60.

Maybe I'm completely wrong in understanding your suggestion. Can you please elaborate?

sglienke commented 7 years ago

I'm not quite sure what do you mean. With this suggestion the difference between the two scenarios you provided will be even bigger, isn't it?

If you multiply by the author number, yes. But that is not what I meant. Let's just for examples sake say that you multiply by 1 + 0.1 * (authors-1) then you get:

(20 1 + 4 8) 1.2 = 62,4 (60 1) * 1 = 60

I also think that PR and issues should not weigh so much more than commits. They should also be rebalanced. It might take a bit more effort to actually come up with a formula that represents the activity properly across multiple projects. And I am not getting into the time component which might also be interesting (like what project is more active? One that gets a couple commits every day or one that gets a bunch on one day of the month and nothing happens for the rest of the time). As you see it can become quite complicated, question is are you aiming for that or a simple but inaccurate (imo) number.

Nagasaki45 commented 7 years ago

I think that the best way to decide if the suggested metric is better is to generate a new "most active" list based on it and investigate the results. After all, it is all very subjective. I will do this for the entire DB, and maybe for the python language, as it is both very active and I'm relatively familiar with. Would you like to see the results for other languages?

Nagasaki45 commented 7 years ago

Top 50 repos

Current metric	Suggested metric
CocoaPods/Specs	CocoaPods/Specs
Microsoft/vscode	Microsoft/azure-docs
kubernetes/kubernetes	NixOS/nixpkgs
Microsoft/azure-docs	kubernetes/kubernetes
NixOS/nixpkgs	githubschool/open-enrollment-classes-introduction-to-github
aburasali/cs362w17online	Microsoft/vscode
BlissRoms/platform_frameworks_base	ansible/ansible
ansible/ansible	rust-lang/rust
githubschool/open-enrollment-classes-introduction-to-github	dotnet/corefx
rust-lang/rust	gentoo/gentoo
dotnet/corefx	caskroom/homebrew-cask
caskroom/homebrew-cask	Automattic/wp-calypso
freebsd/freebsd-ports	tensorflow/tensorflow
gentoo/gentoo	tgstation/tgstation
tgstation/tgstation	aburasali/cs362w17online
ampproject/amphtml	jlord/patchwork
Automattic/wp-calypso	Homebrew/homebrew-core
tensorflow/tensorflow	DefinitelyTyped/DefinitelyTyped
jlord/patchwork	hashicorp/terraform
flutter/flutter	facebook/react-native
hashicorp/terraform	DroidKaigi/conference-app-2017
everypolitician/everypolitician-data	saltstack/salt
DefinitelyTyped/DefinitelyTyped	dart-lang/sdk
Homebrew/homebrew-core	freebsd/freebsd-ports
DroidKaigi/conference-app-2017	golang/go
angular/angular-cli	ampproject/amphtml
saltstack/salt	docker/docker
docker/docker	JuliaLang/julia
facebook/react-native	flutter/flutter
dotnet/roslyn	dotnet/coreclr
dart-lang/sdk	angular/angular-cli
JuliaLang/julia	liferay/liferay-portal
golang/go	apple/swift
dotnet/coreclr	dotnet/roslyn
krexus/frameworks_base	nodejs/node
earl/llvm-mirror	elastic/elasticsearch
llvm-mirror/llvm	d3athrow/vgstation13
liferay/liferay-portal	home-assistant/home-assistant
apple/swift	mantidproject/mantid
NixOS/nixpkgs-channels	servo/servo
convox/rack	everypolitician/everypolitician-data
elastic/elasticsearch	openstack/openstack
openstack/openstack	docker/docker.github.io
nodejs/node	cockroachdb/cockroach
freebsd/freebsd	joomla/joomla-cms
dimagi/commcare-hq	dimagi/commcare-hq
cockroachdb/cockroach	librenms/librenms
d3athrow/vgstation13	ManageIQ/manageiq
beagleboard/linux	code-dot-org/code-dot-org
joomla/joomla-cms	llvm-mirror/llvm

Top 50 python repos

Current metric	Suggested metric
ansible/ansible	ansible/ansible
saltstack/salt	saltstack/salt
dimagi/commcare-hq	home-assistant/home-assistant
home-assistant/home-assistant	dimagi/commcare-hq
odoo/odoo	odoo/odoo
LLNL/spack	LLNL/spack
mozilla/addons-server	mozilla/addons-server
wikimedia/mediawiki-extensions	wikimedia/mediawiki-extensions
edx/edx-platform	edx/edx-platform
rg3/youtube-dl	rg3/youtube-dl
fchollet/keras	cloudmesh/classes
zulip/zulip	zulip/zulip
cloudmesh/classes	ros/rosdistro
ros/rosdistro	duckduckgo/zeroclickinfo-fathead
duckduckgo/zeroclickinfo-fathead	fchollet/keras
Azure/azure-cli	coala/coala
AdguardTeam/AdguardFilters	Azure/azure-cli
openshift/openshift-ansible	inasafe/inasafe
coala/coala	statsmodels/statsmodels
statsmodels/statsmodels	openshift/openshift-ansible
google/ggrc-core	ipython/ipython
Theano/Theano	uclouvain/osis
inasafe/inasafe	buildbot/buildbot
pandas-dev/pandas	frappe/erpnext
matplotlib/matplotlib	pandas-dev/pandas
conda/conda	matplotlib/matplotlib
frappe/erpnext	Theano/Theano
scikit-learn/scikit-learn	google/ggrc-core
ipython/ipython	mirumee/saleor
rcbops/rpc-openstack	scikit-learn/scikit-learn
uclouvain/osis	rcbops/rpc-openstack
mirumee/saleor	python/mypy
buildbot/buildbot	bigchaindb/bigchaindb
pisilinux/main	django/django
openshift/openshift-tools	pymedusa/Medusa
python/mypy	airbnb/superset
openembedded/openembedded-core	ManageIQ/integration_tests
bigchaindb/bigchaindb	terasolunaorg/guideline
kubernetes-incubator/kargo	django-oscar/django-oscar
ManageIQ/integration_tests	Cloud-CV/EvalAI
django/django	kubernetes-incubator/kargo
pfnet/chainer	openbmc/openbmc
airbnb/superset	AdguardTeam/AdguardFilters
Cloud-CV/EvalAI	pfnet/chainer
blueboxgroup/ursula	openstates/openstates
pymedusa/Medusa	astropy/astropy
django-oscar/django-oscar	galaxyproject/galaxy
xonsh/xonsh	edx/configuration
getsentry/sentry	SatelliteQE/robottelo
terasolunaorg/guideline	conda/conda

sglienke commented 7 years ago

I actually don't want to crunch some statistics but those lists without the numbers that lead to this outcome don't provide any information to me to see if it got better (imo) or not. :)

Nagasaki45 commented 7 years ago

You are absolutely right! Here is a .csv file with all of the repos data currently in the DB. Waiting to see what you get ;-)

Nagasaki45 / krihelinator

Better krihelimeter calculation #9