HWTechClub / Roadmap

Future plans of HW Tech Club
2 stars 0 forks source link

GitHub Profile Scraper #7

Open FarazzShaikh opened 3 years ago

FarazzShaikh commented 3 years ago

First and Last Name

Faraz Shaikh

Email

farazzshaikh@gmail.com frzskh@hw.ac.uk

Company/Organization (Ex: Heriot-Watt)

Heriot-Watt

Job Title (Ex: Student)

Student

Project Title

GitHub profile scraper (will think of something more creative later)

Briefly describe the project

See bellow

What kind of machines and how many do you expect to use?

None

What operating system and networking are you planning to use?

None?

Any other relevant details we should know about?

See bellow

Additional context


GitHub profile scraper

A self-hosted GitHub profile scrapper. This can be used as a middle-man between your site and GitHub's API.

The problem

The official GitHub API rate limits you to about 60 requests an hour for core and 20 for search. Furthermore, some data simply requires some API gymnastics to retrieve.

Yes the GraphQL API does exist and is better but do you really want to set up GraphQL for static sites? I don't. Besides, its a cool little side project to spend a week on.

The Solution

This will use Firebase Cloud Functions to run a function every couple hours (or whatever interval) and scrape the contents of a GitHub profile via ether good ol' Web Scraping or the GitHub API itself. After that, it will store all the data as one or two documents in Firebase Realtime Database.

The user can then run another Cloud Function to fetch the data from the database. Something like this:

Group 1 (3)

The Use

You can use this to include "real time" GitHub stats in your whatever. Personally, I will use this to do the same in my portfolio site.

The data that would be useful is things like

FarazzShaikh commented 3 years ago

Just a side note - This is not a replacement to the official API. Its simply a buffer between the API and the User so you dont run into the rate limit.

Generally, when you attempt real time GitHub stats using the Official API, you need to make more than 1 request to get all the information you would need to make appealing UI. For example, to display the latest repository, you need to first query the search API then get a URL from the response to then query and get the languages used.

You'd also typically want to query information about more than 1 repo, so you can see how quickly the rate limit will be reached especially if you refresh a couple times or during development of your site. Once it is reached a 401 will crash your app or it will make your UI look ugly unless you provide fallback data.

edit: Yeah, you can increase your limit by providing a key but you can’t really hide your key on static sites.

Akilan1999 commented 3 years ago

I guess for initial implementation we just want to extract raw data as a module. As a part of tech club we would be very interested to link this module to our static website generator.

I don't think we need the scraper to run every few hours. This functionality is only need if we are complex stuff such as tracking commits. Since we are also planning to host it we should make sure the only thing running on our side is the web scraper and custom web site generator module.

FarazzShaikh commented 3 years ago

Yep you can extend this to whatever you need. We can make the interval and everything else fully configurable with something like environment variables. Since its self hosted, the Tech Club can run an instance of this and give it whatever config that suits its needs.

Akilan1999 commented 3 years ago

Agreed we can start this project soon waiting for opinions from @benjaminjacobreji

FarazzShaikh commented 3 years ago

In fact I think it would be very cool if the Tech Club site shows GitHub stats for all its members (with consent, duh). It would incentivise open source development all the while providing some publicity to their projects.

Akilan1999 commented 3 years ago

Agreed

benjaminjacobreji commented 3 years ago

Agreed we can start this project soon waiting for opinions from @benjaminjacobreji

I think this is a great idea! We should do this

FarazzShaikh commented 3 years ago

Sounds good. @Akilan1999 invite me to the organization, the 2FA thing kicked me out last year automatically.

I can set up the repo and the to do lists, this should be very simple

Akilan1999 commented 3 years ago

Sent !

FarazzShaikh commented 3 years ago

Cool will create the repo and everything today evening. I will keep this issue open till the project is complete.