cncf / landscape2

Landscape2 is a tool that generates interactive landscapes websites
https://landscape.cncf.io
Apache License 2.0
145 stars 40 forks source link

API not returning all repos for given project's GitHub org #715

Closed jmertic closed 1 week ago

jmertic commented 2 weeks ago

See examples in https://lfenergy.landscape2.io/api/projects/all.json; we are only having one repo per project returned when, in reality, a project org is specified so we should return all the repos.

tegioz commented 2 weeks ago

Hi @jmertic 👋

That endpoint returns all repositories defined in the landscape data file, which includes the primary repository as defined in the repo_url field, plus any additional one listed in the additional_repos field.

Please note that landscape v2 (intentionally) does neither collect nor process all repositories in a given GitHub organization. So only repositories explicitly listed in the data file will be returned by this endpoint.

jmertic commented 2 weeks ago

Got it - so maybe could you also return project_org so we know in those cases that all repos in the org are included?

tegioz commented 1 week ago

The project_org field does not actually map to any functionality in landscape v2 at the moment. It doesn't exist in our types (and it's not documented), so we aren't even processing it.

I think this problem could also be solved by the annotations proposal I just shared with you. You could add an annotation to signal this status to your application in the way that fits you better (you could use the same field, or something completely different).

jmertic commented 1 week ago

That is a good potential workaround, but it does require manual maintenance on a landscape to see if a project adds new repos under it's existing GH org.

tegioz commented 1 week ago

But it'd be the same as returning the project_org field, only that it'd be a different field on a different location.

I would recommend to not populate the landscape.yml file in an automated way to add all organization's repositories to landscape entries. We haven't added this feature to the landscapes generator intentionally, mainly for sustainability and reliability reasons.

Some organizations are huge, with hundreds and hundreds of repositories, and collecting data for all of them may lead to reaching GitHub rate limits. The landscape build process can also get considerably slower, as we cannot send requests to GitHub too fast or we'll hit the secondary rate limits.

In most cases, only a repository -or a few of them- may be relevant for a particular landscape item, so processing an entire organization would be overkill. In other situations, there are multiple landscape items whose repositories are hosted in the same GitHub organization, so we'd end up misleadingly displaying the same stats for both. I think it's a feature that, for convenience reasons, can be misused too easily.

IMHO the landscape is probably not the best place to explore all repositories in a GitHub organization, as the GitHub UI handles that much better 😉

Please note that if any of the landscapes we host exhausts our tokens rate limits for this reason, we may need to pause its build temporarily until the number of repositories is reduced. The GitHub tokens are reused across landscapes and leaving tokens with no requests available could affect other landscapes builds.

Will close this one in favor of #716, we'll try to have the annotations ready as soon as possible 🙂

jmertic commented 1 week ago

I can understand where you are coming from on this for sure. The challenge comes from inconsistencies in how stats such as stars are reported relative to landscape v1; we almost would need to bring in all the other repos for the org to have an accurate view for the case of understanding the activity of a project, as an example.

Let me chew on an approach for the projects I work with here. Appreciate all your help.

tegioz commented 1 week ago

Sure 👍 No worries, happy to help!