LeslieLeung / opensource-lighthouse

汇总统计各「大厂」的开源团队和项目。
https://opensource-lighthouse.streamlit.app
653 stars 41 forks source link

Team/repo Inclusion & Exclusion Rules #6

Open LeslieLeung opened 3 months ago

LeslieLeung commented 3 months ago

As noted by @chesha1 in #5, determining which team/repo should be counted as a company's contribution can be contentious.

Here are my considerations:

(Account level)

(Repo level)

chesha1 commented 3 months ago

Is it appropriate to include personal accounts that are clearly affiliated?

One large tech company may have thousands of developers with GitHub accounts, making it impractical to include them all. Even though we include them all, the teams.csv will become large. Processing a csv with thousands of lines in Github Actions may be considered to an abuse. We may need to switch to faas like AWS Lambda, which will cause more work for backend.

Additionally, employees often have personal open-source projects, and companies only pay for the working hours of their staff; personal repositories don't belong to the companies.

Аnd specifying the affiliation of personal accounts is challenging. Some accounts simply list their company name in the profile or introduction without an "@", which is easy for humans to identify the affiliation but difficult hard for crawlers.

I may be wrong, but I believe it's better to only count company accounts.

LeslieLeung commented 3 months ago

Is it appropriate to include personal accounts that are clearly affiliated?

One large tech company may have thousands of developers with GitHub accounts, making it impractical to include them all. Even though we include them all, the teams.csv will become large. Processing a csv with thousands of lines in Github Actions may be considered to an abuse. We may need to switch to faas like AWS Lambda, which will cause more work for backend.

Additionally, employees often have personal open-source projects, and companies only pay for the working hours of their staff; personal repositories don't belong to the companies.

Аnd specifying the affiliation of personal accounts is challenging. Some accounts simply list their company name in the profile or introduction without an "@", which is easy for humans to identify the affiliation but difficult hard for crawlers.

I may be wrong, but I believe it's better to only count company accounts.

Apologies for the confusion. When I refer to "users," I am talking about accounts owned by companies rather than actual human individuals, such as https://github.com/azure-sdk.

chesha1 commented 3 months ago

For the first and third type of accounts mentioned in #5, it seems that you agree to include them, so I agree as well. Not excluding certain types of accounts will reduce the trouble on contributors to specify account types.

How about docs accounts? Include them either?

LeslieLeung commented 3 months ago

For the first and third type of accounts mentioned in #5, it seems that you agree to include them, so I agree as well. Not excluding certain types of accounts will reduce the trouble on contributors to specify account types.

How about docs accounts? Include them either?

According to https://www.igor.pro.br/publica/papers/saner2016.pdf , 28% of casual contributions to open source are documentation. So docs accounts should be included.

chesha1 commented 3 months ago

Some projects are donated to Apache or CNCF, but they are still maintained by the original company.

For example, Beam.

There are many projects under Apache or CNCF, some with over 10k stars, so ignoring them is unacceptable.

Do we have the ability to fetch data from the repository level rather than the account level now?

LeslieLeung commented 3 months ago

Some projects are donated to Apache or CNCF, but they are still maintained by the original company.

For example, Beam.

There are many projects under Apache or CNCF, some with over 10k stars, so ignoring them is unacceptable.

Do we have the ability to fetch data from the repository level rather than the account level now?

If a repository is donated to a foundation, I believe it should not be linked to the original company. For instance, Kubernetes, which was donated to the CNCF, still has Google's involvement in its development, but the larger ecosystem is supported by the community (as well as other companies). I personally think these repos should just go to CNCF.

One biggest flaw of opensource-lighthouse is that we can't exactly track a company's every atomic contribution to the community, the best we can do is maintaining a unbiased standard and offer a broad overview.

chesha1 commented 3 months ago

OK, considering that, #7 needs to be modified. Please do not merge it now.

Since we consider the donated projects do not belong to the original company, what about the related projects? For example, Kubernetes is donated to CNCF, but there are also kubernetes-client, kubernetes-sigs, and Knative. Should these related projects be attributed to Google or CNCF?

Does GitHub have APIs to show contributors of a repository whose ranking is over 100? We can only see the first 100 contributors on the web.

We can attribute donated projects to foundations for now and develop an advanced feature in the future. For projects under foundations, we can set a threshold—like 80%. If 80% of the commits are made by the original company, the project will be attributed back to the original company.

LeslieLeung commented 3 months ago

I came up with a little SOP:

chesha1 commented 3 months ago

Great idea.

But after checking the GitHub API documentation, I found that identifying the affiliation of projects under foundations is not as difficult as I initially thought. So, forget about what I mentioned earlier, it turns out this task is simpler than expected. We can automatically attribute projects to the real companies. It's better to handle this in one step rather than improving it later.

We can get the top 5 repositories sorted by stars under an account, then get the contributors of these repositories and check their profiles. If a certain percentage of commits are made by a specific company, we can attribute the account to that company. The threshold for this percentage is an input parameter. The original vision of this project, open-source-lighthouse, is to specify the open-source status of major tech companies. If too many projects are under foundations like Apache or CNCF, they will dominate the list, making it meaningless.

I can deploy a FaaS to handle this, as I have Cloudflare Workers quota unused every month. I find that the workflow used for updating data will take about 20 minutes and with teams.csv growing, it will be longer. So, it might be better to migrate some compute tasks to external resources in case of accusation of abuse from Github.

I can provide an interface to get the real affiliation of an account. For example, if teams.csv has a line kubernetes,CNCF, you can double-check these accounts using external microservices.

The code will also be in this repository. How about placing it in services/affiliation-checker/index.ts or utils/affiliation-checker/index.ts?

chesha1 commented 3 months ago

In this case, we don't need to manually check these foundational or related projects. Simply categorize them under foundations. An external service will verify the actual affiliations by checking the commits from contributors.