ecosyste-ms / repos

An open API service providing repository metadata for many open source software ecosystems.
https://repos.ecosyste.ms
GNU Affero General Public License v3.0
9 stars 1 forks source link

Limit data collection to public organizations/groups #481

Open bzg opened 7 months ago

bzg commented 7 months ago

https://code.gouv.fr/public/#/repos collects data from repositories of public GitHub organizations and public GitLab groups.

My understanding is that https://repos.ecosyste.ms collects data from all groups, public and private.

Can we configure repos so that it only considers public groups?

If so, can we spare the need for (GitLab) tokens?

Upvote & Fund

Fund with Polar

bzg commented 7 months ago

cc @simkim

andrew commented 7 months ago

You'll still need a public token for GitLab as the api doesn't allow read access even for public data without a token, unlike gitea and github

andrew commented 7 months ago

Right now repos will try to crawl a whole forge, and if it's given a token that can see private repos I suspect it may find some, although I've not tested that, we can definitely add a check to reject repositories that are have a private flag.

Groups/orgs are called "owners" in the repos service.

bzg commented 7 months ago

Groups/orgs are called "owners" in the repos service.

Good to know, thanks.

we can definitely add a check to reject repositories that are have a private flag.

Yes, that will be useful.

You'll still need a public token for GitLab as the api doesn't allow read access even for public data without a token

Are you sure? This script collects metadata from GitLab instances without the need for a token. Or maybe I misunderstand what is the token needed for exactly?

andrew commented 7 months ago

Are you sure? This script collects metadata from GitLab instances without the need for a token. Or maybe I misunderstand what is the token needed for exactly?

I'm not 100% on that, will need to double check, but I remember some GitLab API endpoints needing a token, I can't recall which ones though, it was a little while ago that I set that up.

bzg commented 7 months ago

Okay, thanks.

Because we want to crawl a lot for GitLab forges and because obtaining/renewing tokens can be a chore, we would love to have an option to crawl GitLab forges without tokens, even if it means that we don't get all the data we have when crawling with a token.

simkim commented 6 months ago

fully agree on the usefullness to ignore repo flagged as private.

andrew commented 6 months ago

PR to ignore private repos over here: https://github.com/ecosyste-ms/repos/pull/492

andrew commented 6 months ago

I've merged #492 now