Open humphd opened 2 years ago
So, to clarify what we are talking about, and correct me if I'm wrong. And I'll add some of my own ideas too. A microservice within telescope, that uses GitHub user information. Specifically:
Then, connect this data to other data of the same kind to find:
For such a microservice, we would need:
Now, about the tools. Firstly, do we still want this to be a microservice within Telescope, or is it a completely new and different thing? What I roughly know of AI/Machine learning is - better use Python. We can, however, stick with JS, seems to be okay too.
We need to see who's down to try this, and ask them - what they want to use.
What this project will involve:
We can so far simplify it into a smaller step task. Before we even dive into machine learning, we can start with simply connecting to github, and displaying info on user's. Specifically, we can create a contribution log, which would just be a list of latest contributions of those who singed into our system, showing all the informaiton we want to gather. Can start with just username and repo name. Later - adding in language, tags, time. Finally, more detailed stuff like "how long it took for PR to get merged", "who were the reviewers", etc.
Then, a UI for the user to rate their experience. How difficult was the task? How easy was the project to figure out? (setup, instructions, documentation) How familiar were you with the tools used?
Even before we go into unknown and cool machine learning, there's still a lot of stuff we can do.
I'm not sure if we need a microservice for this or not. We could actually start with a static HTML page, since all the data is historic.
Extracting and generating the data to create this HTML will require some code, though.
I think doing this within Telescope is a good idea, since that's where this data comes from, and that's who will likely use it. However, I'm not sure where to "put" it yet. We can solve that later when we have something to put somewhere!
I want to see us add a new "feature" to Telescope. This issue will serve as a starting point, but we'll need to file specific issues for the different parts. I'll get the ball rolling.
Telescope users have the following in common:
One of the things I read over and over again from students is that they say they struggle to find issues to work on during Hacktoberfest. The reality isn't that they struggle to find issues (there are millions), but rather that they struggle to reconcile their current skillset with the expectations of the course and the time available. They also struggle with imposter syndrome, and imagine that they can't work on many projects that they actually could do just fine.
We have a number of tools to help with these problems, based on what I wrote in the list above. First, every student blogs about their PRs, and has done so for years. Second, we have wiki pages with lots of info about what people worked on (i.e., links to issues, PRs). These wiki pages, therefore, contain all kinds of info about projects that previous students found useful for their purposes, and which might still be valuable. Furthermore, their blog posts provide guidance and insights on how to work on them.
We need to mine this data and make it available to the next set of students
Also, I'd like to see us collect info about how to search for issues, how to evaluate projects, etc. based on the current work people have done in October 2022. We should create a guide document of some kind that lays out a template for how to get started and how to be effective.
Here are the previous five years of Hacktoberfest wiki pages with every student's GitHub info we can mine for data:
In the past, I've manually done some of this work:
I have some scripts I use to get stats:
https://github.com/humphd/github-contrib-stats
Things that would be good to extract from those wiki pages, and the GitHub URLs they include:
Since this old data is static, I wonder if we could do any AI/machine learning on it to extract any lessons? Any data/ml folks want to try? For example, we can use the GitHub API to pull all kinds of data (JSON) about each PR and could use that to extract features we might train a model on. Could we build something we can use in the future to evaluate whether a given PR is a good fit?