pdelboca commented 4 years ago

Subtasks created post analysis

Specific tasks created off this analysis (and others related created previously), in order of fix priority:

Don't fetch base revision if update is not partial: #21 (super easy, high impact on fetch)
Reduce calls to _get_owner and _get_repo by caching them: #16 (easy at least initially, medium impact but affects almost all actions, so high over all impact)
Investigate generating owner and repo objects instead of fetching from GitHub: #24 (high impact, low cost to investigate, either easily doable or impossible)
Investigate all instances where _completeIfNeeded() is triggered to see if we can avoid them by providing defaults: #23 (low complexity, impact unsure, anywhere between non existent to medium)
List tags in ckanext-versioning asynchronously: https://github.com/datopian/ckanext-versioning/issues/38 (prob. low cost, high impact on UX for a very commonly used page)
Switch tag listing to GraphQL: #3 (medium impact, high cost of getting the GraphQL infra working, probably 1-2 days. Useful because doing GraphQL can help in other places to.)
Investigate strategies for reducing number of calls during create: #22 (unknown complexity, a lot of investigation work to do, best done after we already do something in GraphQL. If implemented fully this will have solid impact in create but not sure it can be done).

Original Bug Report

I'm using the extension with the Github backend and the time it takes to do the Github API calls to fetch/create/update is really long and it is not good for the overall UX of the extension (I might say quite unusable).

The overall delay only happens when using github as backend since using the filesystem backend doesn't has this overhead of time. For now I'm executing it locally but I don't see too much reasons for this to be better when deploying on services.

For package_update:

For package_show:

shevron commented 4 years ago

This has been a concern of mine for some time, I think it will be slightly better when deployed to Azure because the round-trip time to GitHub for HTTP requests will most likely be faster than on your machine, but it is still a problem.

We may be able to optimize a bit by caching some of the requests made to GitHub, on the ckanext-versioning level or metastore-lib level.

We may be able to optimize a bit by switching some of the calls, especially for requests that do multiple API calls to fetch related data, to the newer GitHub 4.0 GraphQL API (see https://github.com/datopian/metastore-lib/issues/3 for example).

But ultimately, calling the GitHub API to fetch data will be slower than calling a local DB or the file system.

shevron commented 4 years ago

I have created this: https://github.com/datopian/metastore-lib/issues/16 as a possible quick win.

rufuspollock commented 4 years ago

@shevron where's the profiling you were planning to do?

shevron commented 4 years ago

I'm attaching some cProfile dumps. They can be viewed nicely by several tools, I use snakeviz which is pip installable and runs everywhere.

I recommend zooming in on the 1st call immediately after t.py in the call stack (which should be the 1st call made from storage.py and starting any analysis from there. What we want to look at are the parts between this function (the top level API action called in storage.py, e.g. create, fetch, etc.) and requestJsonAndCheck in Requester.py which is where every HTTP request done by PyGithub is performed.

metastore-profiles.tar.gz

This file includes two profiles (had to tarball them otherwise GitHub won't allow to upload): one for a create action and one for a fetch action.

I will add some insight into this soon.

shevron commented 4 years ago

I also recommend using the "sunburst" view if using snakeviz, I find it much easier to work with.

shevron commented 4 years ago

Calling create is quite slow from my machine (in this profile almost 7 seconds). It may be less critical than fetch which is much faster`, but it is still quite slow.

What we can notice in the profile:

Call to _get_owner is 0.8 seconds. The owner rarely changes in some use cases (esp the ckanext-versioning case) so we can probably cache it as suggested in #16 to reduce this time to close to 0 (will still happen on the 1st use depending on what caching backend we use)
Most of the time is spent on _create_commit which is our own code calling 4 different GitHub endpoints (create commit, create tree, get ref for master, edit ref for master). IIRC we have to do all of them but perhaps this can be optimized (maybe there is some kind of "shortcut API" or maybe GraphQL can do it in one mutation or...). I need to study this further to understand.
2nd place is create_repo at about 25% of the time which we just have to call, no other way around it.
GitHub requires you to call a high level API create_file on the repo after creating it and before you are calling any low level Git APIs such as create_commit. I'm not sure why this is the case but I have not been able to work around it. They say the repo can't be empty if you want to access the Git API (?). I'm using this to create a stub README.md file then create the actual first commit of datapackage.json
We're also calling get_branch to get the sha of the tip of the default branch before creating the next commit. This is about 11% of the time, and may be dropped i.e. if there is a way to work around need - will need to dig some more in the GitHub API.

Here is a PNG of the call graph for create:

github-create-callgraph

shevron commented 4 years ago

There are also 2 times where an HTTP request seems to be triggered from PyGithub's _completeIfNeeded. I'm guessing that this is only called if some variable was not provided but has some sensible default, e.g. the default branch name. Maybe we can reduce this (~11% of time) if we provide some values explicitly.

rufuspollock commented 4 years ago

@shevron 👏 great analysis. We have 3 areas i guess:

Create Dataset
Update Dataset
Get Dataset

My understanding is this refers to Create Dataset. Overall, i'm actually not that concerned if Create Dataset is slow since it happens rarely and we could always provide feedback to user to improve UX (e.g. spinner and updates like creating dataset, adding metadata ...). If Update Dataset is slow that may be a bigger issue.

Call to _get_owner is 0.8 seconds. The owner rarely changes in some use cases (esp the ckanext-versioning case) so we can probably cache it as suggested in #16 to reduce this time to close to 0 (will still happen on the 1st use depending on what caching backend we use)

I'm not sure what _get_owner does. Rather than caching (which adds complexity) could this be explicitly be provided by the instantiator?

shevron commented 4 years ago

My understanding is this refers to Create Dataset. Overall, i'm actually not that concerned if Create Dataset is slow since it happens rarely and we could always provide feedback to user to improve UX (e.g. spinner and updates like creating dataset, adding metadata ...). If Update Dataset is slow that may be a bigger issue.

Correct, this was just a first dump of data (I am adding "fetch" now) just had to leave so I wanted to save it here before I do. Agree about UX, but a lot of what is being done for create also happens for update so this analysis is still mostly relevant.

I'm not sure what _get_owner does. Rather than caching (which adds complexity) could this be explicitly be provided by the instantiator?

Sorry, you are right - it is a bit vague... _get_owner() is a private method that fetches either an Organization object or an AuthenticatedUser object from GitHub, depending on whether the owner is the current user (owner of the access token) or an organization. This object regardless of it's type is then used solely for the purpose of calling get_repo(name) to get a PyGitHub Repository object describing the repository. This is needed for pretty much every action we do.

Perhaps we don't really need to fetch the repo object from GitHub, but somehow "generate" it. It is one of the biggest classes in PyGithub and has a lot of properties, and all except the URL (I think) could be lazy-loaded when needed. Maybe we can create this object and manually set some properties we need, and it will just work, but I need to further investigate this.

shevron commented 4 years ago

Analysis for fetch:

github-fetch-callgraph

Fetch is actually quite fast on my machine compared to create - under 1 second. A lot of the time is spent on _get_owner which as discussed may be optimized out or cached. It is possible that get_repo may also be worked around (another ~15% of time) using the same strategy discussed above. I don't think the other calls (get_git_ref or get_tag to get the revision sha, get_commit to get the git commit and get_contents to get the contents of datapackage.json from that commit) can be avoided, but perhaps using the GraphQL API could allow us to send only one or two requests instead of 3.

shevron commented 4 years ago

I noticed that in the OP, @pdelboca reported that package_show took ~7 seconds. I need to further analyze this because from what I can tell metastore-lib's fetch is not the culprit and maybe package_show is doing other things. Or we just got lucky / unlucky with a single test.

shevron commented 4 years ago

Analysis for update (non-partial, and not referencing a base commit, as performed by ckanext-versioning):

github-update-callgraph

This calls get_owner / get_repo twice, once when fetching and then when updating. Surely in this case we can just fetch the repo once and pass it around / cache it on the instance level instead. May reduce about 9% right off the bat.
We call fetch (37%) even though this is not a partial update and we do not need the data. This we can just not do 🤦 (and if we don't call fetch we don't really need the previous fix even though it would be nicer to do it anyway if we can to speed up partial updates as well a bit).
We call _create_commit (40%) which is similar to how things are done for create so same analysis as above.
We call get_branch to get the tip commit sha of the latest branch. We could save this somewhere in CKAN DB and pass it in but I really don't think this is worth optimizing (less than 5% of time) and it can create a lot of consistency headaches.

shevron commented 4 years ago

@rufuspollock I think there is enough here to create some concrete tasks and get fixing. Any thoughts on how to proceed or should I just create a few additional tickets and we can prioritize them?

pdelboca commented 4 years ago

@shevron when I reported this with the screenshots I was testing with a dataset that had 4 or 5 versions created, so the ~7 secs in package_show may be related to #3 . (Although I'm not sure)

shevron commented 4 years ago

@shevron when I reported this with the screenshots I was testing with a dataset that had 4 or 5 versions created, so the ~7 secs in package_show may be related to #3 . (Although I'm not sure)

This is highly likely, as we know tag listing (aka "versions") will slow down linearly with the number of tags.

rufuspollock commented 4 years ago

@shevron one question here: i thought that revision tags are loading in javascript (and hence async) and should not block page load (so being slow is not great but should not block the page) (?)

shevron commented 4 years ago

@rufuspollock I thought so too, but I will verify.

shevron commented 4 years ago

@rufuspollock actually no - version list is rendered on the server. I think it was AJAX at some point but it is not the case anymore for a long time.

shevron commented 4 years ago

@rufuspollock actually no - version list is rendered on the server. I think it was AJAX at some point but it is not the case anymore for a long time.

I created this ticket to make it async: https://github.com/datopian/ckanext-versioning/issues/38

rufuspollock commented 4 years ago

@shevron ok great re async rendering. I think that will help solve a lot of the issue. I note that github only shows the number of releases/revision tags and that the actual rendering is a separate page (i think we may want to go that direction frankly).

shevron commented 4 years ago

Specific tasks created off this analysis (and others related created previously):

Switch tag listing to GraphQL: #3
List tags in ckanext-versioning asynchronously: https://github.com/datopian/ckanext-versioning/issues/38
Reduce calls to _get_owner and _get_repo by caching them: #16
Don't fetch base revision if update is not partial: #21
Investigate strategies for reducing number of calls during create: #22
Investigate all instances where _completeIfNeeded() is triggered to see if we can avoid them by providing defaults: #23
Investigate generating owner and repo objects instead of fetching from GitHub: #24

rufuspollock commented 4 years ago

@shevron great - can you move this to top of issue description and give an assessment (next to them or rough payoff and rough cost in days).

datopian / metastore-lib

[epic] Speed of Github API Calls #17

Subtasks created post analysis

Original Bug Report