Seneca-CDOT / telescope

A tool for tracking blogs in orbit around Seneca's open source involvement
https://telescope.cdot.systems
BSD 2-Clause "Simplified" License
94 stars 188 forks source link

Extract and store GitHub URL data in parser service #2831

Open humphd opened 2 years ago

humphd commented 2 years ago

In #2827 we're adding support for dependency information from npm and GitHub. We'll have the ability to query for any data we need about a package we use.

At the same time, we have thousands of old blog posts that include URLs to GitHub projects, Issues, and Pull Requests. Let's extract and index all this GitHub project data in the parser service.

Currently when we parse a post, we store the data in Redis, and also index the data in Elasticsearch. Let's also pull all URLs out of a post's text, and then figure out which ones are GitHub related. We can then include this data in our "database." Since we're adding Supabase to the mix, and that means we have 3 options for where to put this data:

Depending on what we want to do with this data in the future, we can pick the right backend(s).

Ideally, it would be good to gather metrics about which GitHub users/orgs, repos, issues, and PRs our community is discussing, and connect this back to what we know about our own dependencies (it will sometimes overlap, and sometimes not).

I think that searching this data is one way I can imagine using it, so maybe Elasticserach is the best option? For example, I might want to find blog posts where people worked, or wrote about a particular repo by giving its name. Which user/orgs are most popular? Which repos get the most attention?

NOTE: I'm suggesting we only do this in the new parser service, and not bother back-porting to the legacy parser.

humphd commented 2 years ago

Another thing that might be interesting is to pull down data about the project/issue/pr from GitHub and store that too. For example, with a PR, was it merged? Who wrote the code? How much code was added/removed? What language(s) does the project use (JS vs. C++).

humphd commented 2 years ago

Getting this data from GitHub overlaps with our dependency service, see #2827, and we might be able to connect these somehow?

humphd commented 2 years ago

Now that we have Supabase, it probably makes sense to store this info in a postgres table, since it won't change very often.

JerryHue commented 2 years ago

I would like to pitch for Postgres, for two reasons:

However, it is totally valid to use Elasticsearch as the backing store for these relationships. I am open to other points.

Andrewnt219 commented 2 years ago

I'm unassigning myself from this due to not having time for researching. Sorry for holding the work. The issue is up for grabs.

humphd commented 2 years ago

No problem @Andrewnt219, understood. If you want to get involved later, you'd be welcome.

@JerryHue I think using Supabase for this is a fine idea. If we want to layer on some specific search use case later, and index things in ES too, we can always add it.

TueeNguyen commented 2 years ago

Closing via #3215

JerryHue commented 2 years ago

Opening because the extraction of the GitHub information is not being done, just the storage.