countering-bean-counting / bonnyci_ci-plunder

CI usage data plundering
2 stars 0 forks source link

Develop a method for random sampling #6

Closed missaugustina closed 7 years ago

missaugustina commented 7 years ago

This will be tested in another issue by getting demographics and comparing the variance between samples.

missaugustina commented 7 years ago

1) Get a list of repositories and assign them a number: SELECT *, ROW_NUMBER() OVER() as number FROM ( SELECT repo.name, org.login FROM githubarchive:day.20170108 GROUP BY repo.name, org.login )

(githubarchive table can also be month.YYYYMM or year.YY)

2) Save to a table to your dataset of choice

3) Generate a random number/s between 1 and # rows in the table (https://www.random.org/integers if max is under 10^6)

4) Get your sample: select * from your_set.your_table where number in (x,y,z)

5) Use Shuffleboard to download the data for your sample repos via github and import it into the database

missaugustina commented 7 years ago

Updated wiki with the info in the last comment.