How to scape large amount of data??

cs373gc-fall-2016 / spartify

Apache License 2.0

0 stars 0 forks source link

How to scape large amount of data?? #45

Closed lock14 closed 8 years ago

lock14 commented 8 years ago

Need a bunch of data. Use github api (https://developer.github.com/v3/).

Only language names are available through Github API (As far as I can tell anyways), So other language data, such as creator, paradigms, etc will probably need to be hand scraped. Wikipedia is a good source for this. There won't be many languages (probably < 20) and We already have data on most of the popular ones (C, C++, C#, Java, Ruby, Python, Javascript) So we don't even need data for those.

cindywu2018 commented 8 years ago

Need to figure out what is the best method to scrape data. For the Languages model, it's really easy to just populate it with data from Wikipedia via copy pasting, but for models like Projects, there needs to be an automatic way to get a lot of data.

One way I'm thinking of is to do a Get API call to GitHub that gets a list of all organizations in the order that they were created on GitHub, and then from each organization, get its repos_url or members_url.

lock14 commented 8 years ago

I collected a bunch a data, its been committed already, We just need to get the language data (only have the name right now) and then feed it into the DB, then we can create SQL dump file that we can use like before.

lock14 commented 8 years ago

The scraper I built is here: https://github.com/lock14/github_scaper