Challenge & Notes - Githubissues

Challenge

classify/label repos automatically
analyze relevant features
document design thoughts and training approach

Documentation Structure

Data Exploration and Prediction Model
- analyze and document relevant features
- document how to avoid overfitting
- explain why we've decided to use the features
- explain how we've developed the prediction model
Automated Classification
- implement the app that takes the input format and creates the output format
- either 1) prompt for the training data to use or 2) directly include the learned model
Validation
- validate with Appendix B
- create a boolean matrix with our estimated label and the predicted one
- compute recall per category
- compute precision per category
- dicuss quality of results and whether higher yield or higher precision is more important
Extension
- use the model for a nice app
Furthermore
- document 3 repos where we think our model will yield better results
- install and user manual
- document decisions we made for features, algorithms, data structures, software development tools and practices
  Notes

Examples for DATA-Repositories openaddresses / openaddresses unitedstates / congress-legislators OpenExoplanetCatalogue / open_exoplanet_catalogue Chicago / food-inspections-evaluation GSA / data cernopendata / opendata.cern.ch benbalter / congressional-districts

Extension

"Improve yourself"

Login with Github -> Stats of your own repos e.g. 30% Data, 70% Software -> Stats of repos your friends recently starred |-Data-| Software | Homework | ...| -> Stats of trending repos |-Data-| Software | Homework | ...|recently

Sources:

https://github.com/caesar0301/awesome-public-datasets -> what's hosted at github?
https://github.com/datasets
https://github.com/showcases/open-data
github.com/explore
github.com/trending

WGierke / git_better

Challenge & Notes #2

Challenge

Notes

Extension