epam / OSCI

Open Source Contributor Index
https://opensourceindex.io/
GNU General Public License v3.0
161 stars 99 forks source link

Filter only on projects with open-source licenses #6

Open abitrolly opened 4 years ago

abitrolly commented 4 years ago

(Housekeeping - I move the original issue written here by @abitrolly into #8)

Enhance the OSCI algorithm to filter only projects with open-source licenses. This will require some external datasets.

EmbeddAlex commented 4 years ago

That's a good idea. We were thinking about external datasets. We want to add an information about repositories licenses, but GHArchive.org does not contain their. As an option we can cache the data to DB using GitHub API, which has some limitations of a number requests(5000 requests per a hour). I would like to ask you if you already have some thoughts how to do it better?

abitrolly commented 4 years ago

First I would count all repositories that companies have committed to. Maybe there are less than 50000 and an external job can gather and publish this data daily. It could be even a service for companies to track license changes.

abitrolly commented 4 years ago

Another way is to patch GHArchive to parse the data. I haven't seen yet what kind of events it receives, but license is present on repo objects in event responses https://developer.github.com/v3/activity/events/types/

abitrolly commented 4 years ago

Yet another way is to add license info to https://release-monitoring.org/ and dump information from there. This one is limited to versioned projects that have enough value to be monitored and packaged.

EmbeddAlex commented 4 years ago

I can help with that. For example at this moment we have ~2600457 unique IDs of repositories for 2020. In average we get an information about ~1026296 unique repositories, bigger part from them repeats everyday.

First I would count all repositories that companies have committed to. Maybe there are less than 50000 and an external job can gather and publish this data daily. It could be even a service for companies to track license changes.

EmbeddAlex commented 4 years ago

Another way is to patch GHArchive to parse the data. I haven't seen yet what kind of events it receives, but license is present on repo objects in event responses https://developer.github.com/v3/activity/events/types/

This link returns a response with license type if you will fill :owner/:repo fields. https://api.github.com/repos/:owner/:repo/license GHArchive uses the following link: https://api.github.com/events

abitrolly commented 4 years ago

For the first approximation it then easier to use https://console.cloud.google.com/marketplace/details/github/github-repos as a start (kaggle tutorial and then https://libraries.io/data. But maybe resorting to http://ghtorrent.org/ is the easiest way.

patrickstephens1 commented 4 years ago

This thread mixes two features so I would like to separate them. Let's use this issue for the feature of how to find license information and add that into our algorithm. I'll open a separate issue to track the feature for adding location information.

patrickstephens1 commented 4 years ago

The BigQuery data on the GCP writes "Last modified 20 Mar 2019, 22:03:20". Just BTW. Still can be useful for some test queries.

abitrolly commented 4 years ago

@patrickstephens1 maybe it is possible to send letter to new owners of GitHub asking if they plan to fix publishing of this dataset?

patrickstephens1 commented 4 years ago

@abitrolly I'll dig into it. I can see this dataset on GCP was created by Google back in 2016 when github was an independent company. Now that github is owned by Microsoft, it might be a different situation!