X-lab2017 / open-digger

Open source analysis tools
https://open-digger.cn
Apache License 2.0
286 stars 86 forks source link

[Thesis need] What's the best way to get data? #872

Closed gymgym1212 closed 1 year ago

gymgym1212 commented 2 years ago

Hi community, nowadays I have recently been thinking about how to get the data I need for my thesis experiment. Now that I have a list of CII repositories, I want to get all the repositories and developers within 6 hops of the GitHub-wide collaboration network from those repositories and use that data to build my heterogeneous network of developer collaboration.

This is not a simple task, as it involves a lot of data and can be very time-consuming. So I wanted to ask what are the best practices for getting this done? Notebook? @frank-zsy @wengzhenjie @bifenglin

cii_repo_list.csv

frank-zsy commented 2 years ago

First you need to define how the collaboration network build, what the node and edge is. And 6 hops may contain lots of data like almost all the network. And you may need to constrain the time range, when the time range expands, the data will increase too. And maybe I can do this for you on the graph database.

gymgym1212 commented 2 years ago

Many thanks! @frank-zsy I'm still considering the exact structure of the network, and I'll continue to leave comments as I determine it. Then I will learn your data processing flow, because I may adjust the graph structure and can't always bother you.

frank-zsy commented 2 years ago

@gymgym1212 Here is some data I tried with the heterogeneous collaborate network with activity as edge and user/repo as node. Only starting from CII badge repo with no time range limit, nodes returned for 2 hops are 67,323, for 3 hops are 3,180,757, for 4 hops are 11,435,225 and for 5 hops are 17,774,809 (half of the whole network).

And if I tried to get data for 6 hops, the query can not return in 10 hours. And I guess if you start with more than 4,000 nodes, the nodes returned in 3 hops can be enormous like maybe 10 million nodes which is 1/3 of the whole network and the whole network may return in 5 hops.

Update: start from kubernetes, data in 5 hops returns in 2 hours with 22,488,086 nodes (the largest WCC of the network has 26,119,218 nodes) which almost cover all the active nodes on GitHub. @gymgym1212