Closed trent-codecov closed 6 months ago
@trent-codecov, I was doing some digging and here's what I've got. I'd like to take some time during our 1:1 or setup some meeting time to quickly go through this, so no need to read this if you don't have time.
The current REST approach loops around N
repos that were synced in the sync_repos
task and performs the sync_language task, calling the provider N
times - these are repos we guaranteed have on our DB. The main difference with the GQL approach is it uses an owner to call M
repos from the provider, where N
and M
could be different here (N being the repos synced in our DB, M being the repos in the provider). Because of that, we'd also have to make M
DB calls to ensure the repo we see in the provider is also in our DB, then we update languages. We'd be trading some rate limit requests with some DB ones, imo not terrible but the task/code would be a bit different. Also would be a special case for gh separate from gl/bb. Not opposed to the refactoring, just that it's not a "replace rest w/ gql" change.
Other interesting things, mostly related to the sync_repos task + consequently to the sync_languages task:
TLDR, the sync_languages task currently gets the affected repoids from the sync_repo task. The sync_repo task has 2 variants: one for owners that have integrations (A), one for owners that don't (B). While they both sync_repos, variant B syncs a list of all repos
a user has access to to - as an example, I have access to the repos from these owners (this is local data) {('github', 'adrian-codecov'), ('github', 'aviquez96'), ('github', 'codecov'), ('github', 'temp-org-test-123'), ('github', 'terry-codecov'), ('github', 'Turing-Corp')}
. If you dig further into list_repos(), you'll see it behaves differently whether you supply a username, e.g. "codecov", which only looks for repos for that org, vs when you don't, which fetches all the repos from all the owners you have available. Now, I've played a little bit locally, and when you specify a username it's substantially
faster; our current implementation I believe does what it does cause back in the day we used to have that "all repos" page right, but we've gotten rid of that and always specify an org, but this change didn't really reflect that. I think this is a HUGE
opportunity for performance improvement around the sync task in general, and it might be that we just missed it in the echoes.
To be super explicit, when I click resync here, I'm resyncing for all of these even though I have Codecov
selected. That imo can be arguably an undesired behavior
.
Now, maybe syncing "everything" is still a desired functionality, and we can play around that if we want to, but I think this can be an opportunity to optimize calls here. I'm not sure if enterprise customers have only 1 org, in which case this wouldn't help too much, but if they did, that would be $$
Lastly! Referring to variant A of our sync_repo tasks, that syncs based on whether you're using integration, I have reason to believe we're not leveraging that functionality at all. Can expand more on this too, but basically all of the places that call the sync task have using_integration = False
afaik
Rather than N+1 rest calls, we can leverage the graphql API to get the results we need.
Repository has a languages connection Language
Example query
Used with the explorer