X-lab2017 / open-digger

Open source analysis tools
https://open-digger.cn
Apache License 2.0
286 stars 85 forks source link

[Sample Data] Request a list of all repositories in the open-digger dataset, in the format "GithubId/repository name". #1408

Closed ZhangChunXian closed 10 months ago

ZhangChunXian commented 11 months ago

Usage

For personal research

Extract SQL

I wanna the list of all repositories in the open-digger dataset, in the format "GithubId/repository name, such as "X-lab2017/open-digger". I'd appreciate it if you could provide it. 我想要open-digger数据集中收录的所有仓库的名字, 格式为"githubId/仓库名", 就比如"X-lab2017/open-digger". 如果能提供的话, 万分感激.

Does this dataset need to be updated regularly?

No response

Zzzzzhuzhiwei commented 11 months ago

Hi, you can use the data in the file for your research. or you can use the labeled data we released. they both contain the repositories list.

frank-zsy commented 10 months ago

@Zzzzzhuzhiwei I think @ZhangChunXian is requesting the whole repo and user list that OpenDigger export which is not currently in OpenDigger sample data and exported data.

I think we can do this in monthly export task to a csv or JSONL files, however the files maybe really large.

Zzzzzhuzhiwei commented 10 months ago

I see! If we export this file, it might be too large. Now, there are 328,032,951 different repositories in the clickhouse.

frank-zsy commented 10 months ago

Although there are lots of repos on GitHub but we only export about 500 thousand repos. I try to find out how large the file will be:

image

If we use csv file with id and name in a line, the file will be about 17MB.

frank-zsy commented 10 months ago

And the user file will be about 5MB, I think this is feasible for monthly export task.

/self-assign

frank-zsy commented 10 months ago

@ZhangChunXian Thanks for the issue, the lists have been exported to repo_list.csv and user_list.csv, please feel free to use them in your research. Welcome to any further suggestions and questions.