Open abitrolly opened 4 years ago
That's a good idea. We were thinking about external datasets. We want to add an information about repositories licenses, but GHArchive.org does not contain their. As an option we can cache the data to DB using GitHub API, which has some limitations of a number requests(5000 requests per a hour). I would like to ask you if you already have some thoughts how to do it better?
First I would count all repositories that companies have committed to. Maybe there are less than 50000 and an external job can gather and publish this data daily. It could be even a service for companies to track license changes.
Another way is to patch GHArchive to parse the data. I haven't seen yet what kind of events it receives, but license
is present on repo objects in event responses https://developer.github.com/v3/activity/events/types/
Yet another way is to add license info to https://release-monitoring.org/ and dump information from there. This one is limited to versioned projects that have enough value to be monitored and packaged.
I can help with that. For example at this moment we have ~2600457 unique IDs of repositories for 2020. In average we get an information about ~1026296 unique repositories, bigger part from them repeats everyday.
First I would count all repositories that companies have committed to. Maybe there are less than 50000 and an external job can gather and publish this data daily. It could be even a service for companies to track license changes.
Another way is to patch GHArchive to parse the data. I haven't seen yet what kind of events it receives, but
license
is present on repo objects in event responses https://developer.github.com/v3/activity/events/types/
This link returns a response with license type if you will fill :owner/:repo fields. https://api.github.com/repos/:owner/:repo/license GHArchive uses the following link: https://api.github.com/events
For the first approximation it then easier to use https://console.cloud.google.com/marketplace/details/github/github-repos as a start (kaggle tutorial and then https://libraries.io/data. But maybe resorting to http://ghtorrent.org/ is the easiest way.
This thread mixes two features so I would like to separate them. Let's use this issue for the feature of how to find license information and add that into our algorithm. I'll open a separate issue to track the feature for adding location information.
The BigQuery data on the GCP writes "Last modified 20 Mar 2019, 22:03:20". Just BTW. Still can be useful for some test queries.
@patrickstephens1 maybe it is possible to send letter to new owners of GitHub asking if they plan to fix publishing of this dataset?
@abitrolly I'll dig into it. I can see this dataset on GCP was created by Google back in 2016 when github was an independent company. Now that github is owned by Microsoft, it might be a different situation!
(Housekeeping - I move the original issue written here by @abitrolly into #8)
Enhance the OSCI algorithm to filter only projects with open-source licenses. This will require some external datasets.