codecov / engineering-team

This is a general repo to use with GH Projects
1 stars 1 forks source link

[Shared] - Consider refactor to graphql endpoint #1322

Closed trent-codecov closed 6 months ago

trent-codecov commented 7 months ago

Rather than N+1 rest calls, we can leverage the graphql API to get the results we need.

Repository has a languages connection Language

Example query

Used with the explorer

query Repos($owner: String!, $cursor: String) {
    repositoryOwner(login: $owner) {
        repositories(
            first: 100
            ownerAffiliations: OWNER
            isFork: false
            isLocked: false
            orderBy: { field: NAME, direction: ASC }
            after: $cursor
        ) {
            totalCount

            pageInfo {
                hasNextPage
                endCursor
            }

            nodes {
                name
                description
                languages(first: 100) {
      totalSize
      edges {
        size
        node {
          name
          id
        }
      }
    }
            }
        }
    }
}
{
  "owner": "Codecov" 
}
adrian-codecov commented 7 months ago

@trent-codecov, I was doing some digging and here's what I've got. I'd like to take some time during our 1:1 or setup some meeting time to quickly go through this, so no need to read this if you don't have time.

The current REST approach loops around N repos that were synced in the sync_repos task and performs the sync_language task, calling the provider N times - these are repos we guaranteed have on our DB. The main difference with the GQL approach is it uses an owner to call M repos from the provider, where N and M could be different here (N being the repos synced in our DB, M being the repos in the provider). Because of that, we'd also have to make M DB calls to ensure the repo we see in the provider is also in our DB, then we update languages. We'd be trading some rate limit requests with some DB ones, imo not terrible but the task/code would be a bit different. Also would be a special case for gh separate from gl/bb. Not opposed to the refactoring, just that it's not a "replace rest w/ gql" change.

Other interesting things, mostly related to the sync_repos task + consequently to the sync_languages task:

TLDR, the sync_languages task currently gets the affected repoids from the sync_repo task. The sync_repo task has 2 variants: one for owners that have integrations (A), one for owners that don't (B). While they both sync_repos, variant B syncs a list of all repos a user has access to to - as an example, I have access to the repos from these owners (this is local data) {('github', 'adrian-codecov'), ('github', 'aviquez96'), ('github', 'codecov'), ('github', 'temp-org-test-123'), ('github', 'terry-codecov'), ('github', 'Turing-Corp')}. If you dig further into list_repos(), you'll see it behaves differently whether you supply a username, e.g. "codecov", which only looks for repos for that org, vs when you don't, which fetches all the repos from all the owners you have available. Now, I've played a little bit locally, and when you specify a username it's substantially faster; our current implementation I believe does what it does cause back in the day we used to have that "all repos" page right, but we've gotten rid of that and always specify an org, but this change didn't really reflect that. I think this is a HUGE opportunity for performance improvement around the sync task in general, and it might be that we just missed it in the echoes.

To be super explicit, when I click resync here, I'm resyncing for all of these even though I have Codecov selected. That imo can be arguably an undesired behavior Screenshot 2024-03-01 at 8 20 45 PM.

Now, maybe syncing "everything" is still a desired functionality, and we can play around that if we want to, but I think this can be an opportunity to optimize calls here. I'm not sure if enterprise customers have only 1 org, in which case this wouldn't help too much, but if they did, that would be $$

Lastly! Referring to variant A of our sync_repo tasks, that syncs based on whether you're using integration, I have reason to believe we're not leveraging that functionality at all. Can expand more on this too, but basically all of the places that call the sync task have using_integration = False afaik