X-lab2017 / open-digger

Open source analysis tools
https://open-digger.cn
Apache License 2.0
298 stars 88 forks source link

[Feature] New data export method #1033

Closed frank-zsy closed 2 years ago

frank-zsy commented 2 years ago

As mentioned in #1030 , I would like to make a new data export method to export GitHub repo and user data to OSS, so developers can use the data to implement their own applications.

The plan is:

zhicheng-ning commented 2 years ago

Hi, @frank-zsy . I want to know whether the data in OSS we provide to other developers contains only the implementation of metrics, not include raw log data.

frank-zsy commented 2 years ago

@zhicheng-ning Yes, only metric or index, raw data will be only provided by Clickhouse sample data image.

frank-zsy commented 2 years ago

In order to make sure the data is sufficient for Hypercrx and other demands. We still need to implement the following metrics:

frank-zsy commented 2 years ago

All the metrics above are implemented in #1036 now.

But the thing is if we upload about 2 million repos and 2 million developers, right now the file amount will be 2 13 + 2 2 = 30 million since there are 13 metrics for repos and 2 metrics for developers.

The file amount is really too large and very hard to upload to OSS, and the amount will be much larger if we implement more and more metrics.

I think we can reduce the repos and users numbers according to the mathematical attribution and distribution of activity or OpenRank. A desirable amount could be about 0.5 million for repos and users each.

frank-zsy commented 2 years ago

A meaningful method is using ln distribution for OpenRank for all repos and users.

If using ln(openrank)>1 which means openrank > e for any month, there will be 0.52 million repos and 0.35 million users for now which is a desirable amount.