Closed frank-zsy closed 2 years ago
Hi, @frank-zsy . I want to know whether the data in OSS we provide to other developers contains only the implementation of metrics, not include raw log data.
@zhicheng-ning Yes, only metric or index, raw data will be only provided by Clickhouse sample data image.
In order to make sure the data is sufficient for Hypercrx and other demands. We still need to implement the following metrics:
All the metrics above are implemented in #1036 now.
But the thing is if we upload about 2 million repos and 2 million developers, right now the file amount will be 2 13 + 2 2 = 30 million since there are 13 metrics for repos and 2 metrics for developers.
The file amount is really too large and very hard to upload to OSS, and the amount will be much larger if we implement more and more metrics.
I think we can reduce the repos and users numbers according to the mathematical attribution and distribution of activity or OpenRank. A desirable amount could be about 0.5 million for repos and users each.
A meaningful method is using ln
distribution for OpenRank for all repos and users.
If using ln(openrank)>1
which means openrank > e
for any month, there will be 0.52 million repos and 0.35 million users for now which is a desirable amount.
As mentioned in #1030 , I would like to make a new data export method to export GitHub repo and user data to OSS, so developers can use the data to implement their own applications.
The plan is:
activity.json
,openrank.json
,issue_resolution_duration.json
,bus_factor.json
and etc. So developers can easily choose what data they can use.