X-lab2017 / open-digger

Open source analysis tools
https://open-digger.cn
Apache License 2.0
286 stars 85 forks source link

[Idea] 2023 China Open Source Annual Report #1421

Closed wj23027 closed 7 months ago

wj23027 commented 10 months ago

Description

2023 China Open Source Annual Report follows a structure similar to the previous year, consisting of four sections:

We welcome your suggestions for each section to enhance the report further. Your input is highly appreciated as it will contribute to the report's overall improvement.

mahongweichina commented 10 months ago

In the process of writing the report, some intermediate data is generated. For example, the status of the foundation project was represented using the company's situation in the 2022 report, while in the 2021 report, it was represented using the project's situation. I believe it is acceptable to represent it using the company's situation, making the project's situation an intermediate data. I hope that similar intermediate data can also be made available in other public ways.

Additionally, not all methods have been publicly disclosed during the calculation process. For example, the specific method for confirming which repositories belong to which companies is not fully disclosed. I hope that, as much as possible, this information can be made public so that others can help improve the methods.

mahongweichina commented 10 months ago

The situation regarding community and technology trends is constantly changing, but we don't know what these changes entail. Contributor migration can reflect such trends. I suggest using the current algorithm as a foundation to provide a method for visualizing human migration trends.

frank-zsy commented 10 months ago

In the process of writing the report, some intermediate data is generated. For example, the status of the foundation project was represented using the company's situation in the 2022 report, while in the 2021 report, it was represented using the project's situation. I believe it is acceptable to represent it using the company's situation, making the project's situation an intermediate data. I hope that similar intermediate data can also be made available in other public ways.

We can design the structure of the report to show the information developers and companies really care, we also produce the data from OpenDigger in JupyterNotebook first like 2021 report and 2022 report. Even some data may not included in the final report but we can still produce the data and put into the notebook.

Additionally, not all methods have been publicly disclosed during the calculation process. For example, the specific method for confirming which repositories belong to which companies is not fully disclosed. I hope that, as much as possible, this information can be made public so that others can help improve the methods.

As shown in the notebooks above, actually the data all come from OpenDigger and the company information are also public from the first day of OpenDigger. You can find all the repositories and companies relationships in the companies label data, actually there are companies keep their repositories updated for OpenDigger, like Alibaba, Ant Group, FitCloud and etc. So welcome to update the data by opening issue and we will update the data for you for any company.

mahongweichina commented 10 months ago

Additionally, not all methods have been publicly disclosed during the calculation process. For example, the specific method for confirming which repositories belong to which companies is not fully disclosed. I hope that, as much as possible, this information can be made public so that others can help improve the methods.

As shown in the notebooks above, actually the data all come from OpenDigger and the company information are also public from the first day of OpenDigger. You can find all the repositories and companies relationships in the companies label data, actually there are companies keep their repositories updated for OpenDigger, like Alibaba, Ant Group, FitCloud and etc. So welcome to update the data by opening issue and we will update the data for you for any company.

Perhaps there is a more effective approach to determine the company affiliation of a repository or organization based on the company with the highest number of contributors to the repository. Typically, the majority of contributors come from their respective companies. I acknowledge that this method may not be entirely precise, but when combined with manual effort, it can yield satisfactory results. What do you think of this? Thank you @frank-zsy .

frank-zsy commented 10 months ago

@mahongweichina This could be a method to determine the company affiliation of a repository or organization, but actually the developers' affiliation is even a more challenge work than the former one. How can we determine the contributors company information? I know for some projects in Alibaba, the majority of the contributions are from external developers rather than internal ones, so the result would be really hard to use as a solid proof of the company affiliation.

mahongweichina commented 10 months ago

@mahongweichina This could be a method to determine the company affiliation of a repository or organization, but actually the developers' affiliation is even a more challenge work than the former one. How can we determine the contributors company information? I know for some projects in Alibaba, the majority of the contributions are from external developers rather than internal ones, so the result would be really hard to use as a solid proof of the company affiliation.

I agree that determining the contributors' company information is a challenge. We can attempt to determine it based on the company and email information in their profiles, although this method is not perfect. The majority of contributions on some projects come from external developers rather than internal ones. However, these external developers represent various different companies, and the number of developers from a single external company is generally not greater than the number of internal developers from a single company.

frank-zsy commented 10 months ago

@mahongweichina Agreed, the diversity of contributors is quite important for the community, actually OpenDigger also published this as contributor email suffixes metric data. You can easily find out the email suffix of contributors in a community by this metric as an index of diversity.