Telescope Hacktoberfest Guide and Tools

humphd commented 2 years ago

I want to see us add a new "feature" to Telescope. This issue will serve as a starting point, but we'll need to file specific issues for the different parts. I'll get the ball rolling.

Telescope users have the following in common:

are Seneca students
take (or took) the open source courses
work (or worked) on the Telescope project in some capacity
participate (or participated) in Hacktoberfest
blog (or blogged) about their work
left a record of the PRs they created and projects they worked on in one of our wikis

One of the things I read over and over again from students is that they say they struggle to find issues to work on during Hacktoberfest. The reality isn't that they struggle to find issues (there are millions), but rather that they struggle to reconcile their current skillset with the expectations of the course and the time available. They also struggle with imposter syndrome, and imagine that they can't work on many projects that they actually could do just fine.

We have a number of tools to help with these problems, based on what I wrote in the list above. First, every student blogs about their PRs, and has done so for years. Second, we have wiki pages with lots of info about what people worked on (i.e., links to issues, PRs). These wiki pages, therefore, contain all kinds of info about projects that previous students found useful for their purposes, and which might still be valuable. Furthermore, their blog posts provide guidance and insights on how to work on them.

We need to mine this data and make it available to the next set of students

Also, I'd like to see us collect info about how to search for issues, how to evaluate projects, etc. based on the current work people have done in October 2022. We should create a guide document of some kind that lays out a template for how to get started and how to be effective.

Here are the previous five years of Hacktoberfest wiki pages with every student's GitHub info we can mine for data:

In the past, I've manually done some of this work:

I have some scripts I use to get stats:

https://github.com/humphd/github-contrib-stats

Things that would be good to extract from those wiki pages, and the GitHub URLs they include:

the name of every repo that people have contributed to.
do any repos come up again and again?
are these projects still active?
what languages do they use?
which labels were used on the issues that people worked on? Can we re-use those to make new queries on the same projects?
how long did it take (on average) for PRs to get reviewed? merged?

Since this old data is static, I wonder if we could do any AI/machine learning on it to extract any lessons? Any data/ml folks want to try? For example, we can use the GitHub API to pull all kinds of data (JSON) about each PR and could use that to extract features we might train a model on. Could we build something we can use in the future to evaluate whether a given PR is a good fit?

sirinoks commented 2 years ago

So, to clarify what we are talking about, and correct me if I'm wrong. And I'll add some of my own ideas too. A microservice within telescope, that uses GitHub user information. Specifically:

Contribution activity
Languages used
Forked, starred and contributed to projects
Tags and labels of user's created issues/PRs
Blog posts

Then, connect this data to other data of the same kind to find:

Most commonly contributed to projects
Activity of the project, presence and quality of their README's and CONTRIBUTING docs (?)
Mostly used tech/languages/labels
Combination of those things that leads to the most likely accepted PR vs a failed PR
Analyse the information to assume difficulty of a newly selected project later on

For such a microservice, we would need:

Connecting to GitHub data, reading it and processing it
UI for displaying findings, such as graphs, tables, lists, or other ways we will come up with
User's ability to select their preferred languages, and rate repos they tried themselves
Select on a tech stack to solve this.

Developer tools

Now, about the tools. Firstly, do we still want this to be a microservice within Telescope, or is it a completely new and different thing? What I roughly know of AI/Machine learning is - better use Python. We can, however, stick with JS, seems to be okay too.

We need to see who's down to try this, and ask them - what they want to use.

Steps to take

What this project will involve:

Machine learning back end - select which?
Constructed from data UI front end - graphs, modules, suggestions, contribution log
Github auth
Data transfer format (looks like JSON?)

We can so far simplify it into a smaller step task. Before we even dive into machine learning, we can start with simply connecting to github, and displaying info on user's. Specifically, we can create a contribution log, which would just be a list of latest contributions of those who singed into our system, showing all the informaiton we want to gather. Can start with just username and repo name. Later - adding in language, tags, time. Finally, more detailed stuff like "how long it took for PR to get merged", "who were the reviewers", etc.

Then, a UI for the user to rate their experience. How difficult was the task? How easy was the project to figure out? (setup, instructions, documentation) How familiar were you with the tools used?

Even before we go into unknown and cool machine learning, there's still a lot of stuff we can do.

humphd commented 2 years ago

I'm not sure if we need a microservice for this or not. We could actually start with a static HTML page, since all the data is historic.

Extracting and generating the data to create this HTML will require some code, though.

I think doing this within Telescope is a good idea, since that's where this data comes from, and that's who will likely use it. However, I'm not sure where to "put" it yet. We can solve that later when we have something to put somewhere!

Seneca-CDOT / telescope

Telescope Hacktoberfest Guide and Tools #3738

Developer tools

Steps to take