Extract and store GitHub URL data in parser service

humphd commented 2 years ago

In #2827 we're adding support for dependency information from npm and GitHub. We'll have the ability to query for any data we need about a package we use.

At the same time, we have thousands of old blog posts that include URLs to GitHub projects, Issues, and Pull Requests. Let's extract and index all this GitHub project data in the parser service.

Currently when we parse a post, we store the data in Redis, and also index the data in Elasticsearch. Let's also pull all URLs out of a post's text, and then figure out which ones are GitHub related. We can then include this data in our "database." Since we're adding Supabase to the mix, and that means we have 3 options for where to put this data:

cache it in Redis
index it in Elasticsearch
store it in Postgres

Depending on what we want to do with this data in the future, we can pick the right backend(s).

Ideally, it would be good to gather metrics about which GitHub users/orgs, repos, issues, and PRs our community is discussing, and connect this back to what we know about our own dependencies (it will sometimes overlap, and sometimes not).

I think that searching this data is one way I can imagine using it, so maybe Elasticserach is the best option? For example, I might want to find blog posts where people worked, or wrote about a particular repo by giving its name. Which user/orgs are most popular? Which repos get the most attention?

NOTE: I'm suggesting we only do this in the new parser service, and not bother back-porting to the legacy parser.

humphd commented 2 years ago

Another thing that might be interesting is to pull down data about the project/issue/pr from GitHub and store that too. For example, with a PR, was it merged? Who wrote the code? How much code was added/removed? What language(s) does the project use (JS vs. C++).

humphd commented 2 years ago

Getting this data from GitHub overlaps with our dependency service, see #2827, and we might be able to connect these somehow?

humphd commented 2 years ago

Now that we have Supabase, it probably makes sense to store this info in a postgres table, since it won't change very often.

JerryHue commented 2 years ago

I would like to pitch for Postgres, for two reasons:

Postgres is slowly becoming our primary data store for Telescope. I view this as a good thing. With Postgres, we are going to store copies of the original blog posts from the feeds we have available, so even if a feed stops working forever, we can still use the postgres database to refer to the old posts from these inaccessible feeds. I believe we should also include any kind of data we can extract from these posts, for the sake of saving work (as long as we keep the original blog posts, we could generate all of this data again, but for the sake of saving processing time, it would be ideal to store this synthesized data).
Postgres is a more general data store than Elasticsearch is. Elasticsearch design is tailored for search and indexing, which means that the relationship made between the indices and the documents are useful on a single way, and not the other way around. What do I mean by this? Let's say we set up to create a new index, where we index all posts by the repositories they mention. So, in this case, we can search posts by a specific repository name ("telescope" as search query, gives n results). However, if I want to search repository names mentioned on a specific post, with elasticsearch, is not achievable, unless I create a new index. Of course, you wouldn't want to search the repository names mentioned on the latest post, but if, for example, we want to connect the github issues to github owners, you would need to connect two ways, which means creating two indices. If we want to change the relationships or expand them, we would have to delete our current indices, or creating new ones. With Postgres, we are only defining what entities are related to what, so it gives us more power on how we want to use this data. In fact, we could use Postgres as the backing store, and somehow integrate it to Elasticsearch to provide a more efficient search.

However, it is totally valid to use Elasticsearch as the backing store for these relationships. I am open to other points.

Andrewnt219 commented 2 years ago

I'm unassigning myself from this due to not having time for researching. Sorry for holding the work. The issue is up for grabs.

humphd commented 2 years ago

No problem @Andrewnt219, understood. If you want to get involved later, you'd be welcome.

@JerryHue I think using Supabase for this is a fine idea. If we want to layer on some specific search use case later, and index things in ES too, we can always add it.

TueeNguyen commented 2 years ago

Closing via #3215

JerryHue commented 2 years ago

Opening because the extraction of the GitHub information is not being done, just the storage.

Seneca-CDOT / telescope

Extract and store GitHub URL data in parser service #2831