Closed missaugustina closed 7 years ago
1) Get a list of repositories and assign them a number: SELECT *, ROW_NUMBER() OVER() as number FROM ( SELECT repo.name, org.login FROM githubarchive:day.20170108 GROUP BY repo.name, org.login )
(githubarchive table can also be month.YYYYMM or year.YY)
2) Save to a table to your dataset of choice
3) Generate a random number/s between 1 and # rows in the table (https://www.random.org/integers if max is under 10^6)
4) Get your sample: select * from your_set.your_table where number in (x,y,z)
5) Use Shuffleboard to download the data for your sample repos via github and import it into the database
Updated wiki with the info in the last comment.
This will be tested in another issue by getting demographics and comparing the variance between samples.