Closed jennybc closed 6 years ago
That's a good question.
This is not a particularly strong opinion, but all else being equal, I would prefer to let pagination be handled downstream. There are a few reasons for this:
rgithub
has a very strong one-to-one equivalence between one call to rgithub and one HTTP request. This has the nice benefit that call rate is very predictable, and it's much less likely that you'll hit rate limits by accident.per_page
works uniformly across GitHub's API, which means that (because of 1) per_page
behaves uniformly in rgithub
as well.I would have assumed that GitHub itself wouldn't be fans of auto-pagination because it will hit their servers more often than necessary, but octokit is written by GitHubbers, so I'm clearly wrong there.
So let's look at it from the other side, and assume we want to implement auto-pagination. What would it look like in the API?
Option 1) We could make it a named parameter that is interpreted in a special way. This would mean that code which calls rgithub
functions that return paginated results would change very little.
At the same time, I like that the base implementation of rgithub
is very uniform: I don't want to have to add anything new to github.R
: a general .api.request
seems to be the right way to do it.
So the solution is to find all the HTTP entry points that return paginated results, and write a wrapper. Something like .paginated.api.get.request
, and then have get.my.repositories
, etc. all call .paginated.api.get.request
instead of .api.get.request
.
Option 2) For all entry points that return paginated results, create a new entry point that does auto-pagination. so you'd have get.my.repositories
and get.my.autopaginated.repositories
(bear with my terrible naming skills for now).
Which would you prefer?
This is not a proper response, but adding another link where the GitHubbers really dig into the details of Traversing with pagination. This seems relevant to the discussion regardless.
I haven't done enough of this manually yet to say how deeply I want this integrated into rgithub
itself. So why don't I do that first? I appreciate the discussion. Even if you go with your instinct to keep this out of your package, maybe we could collaborate on a vignette that shows some idioms for, e.g., traversing all pages.
Your mention of the idiom made me think of a general “pagination wrapper” call which uses “computing-on-the-language” (to use Hadley’s parlance) to do the right thing. Something like this:
get.my.repositories()
is the nonpaginated version, but
auto.page(get.my.repositories())
actually does the right thing for paginating the inner call.
I think that should be doable, looks elegant, and wouldn’t require changing the base code at all.
What do you think?
I think that is great idea! I will try to turn my evolving "manual" approach into that. I have already written helpers to parse the link headers, which will be generally useful for the auto.page()
function you propose.
Maybe this is a simple question, but I can't figure it out. How do I implement a "manual" approach? I'm trying to get a list of all files changed by a specific pull request: pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context())
however, it seems to stop at 30 entries/files or so. I've tried e.g. pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), page = 3, per_page=30)
but it gives Error in get.pull.request.files(owner = "rails", repo = "rails", id = 572, : unused arguments (page = 3, per_page = 30)
How can I access them all?
You will have to do all of this "manually". I have written some functions to facilitate this but this is all in my own local fork of this package and is very rough. Here's the sketch of what I do. In my case, I was trying to list repositories, but I would imagine the same thing works for pull requests.
First, max out the number of repos per page, i.e. set it to 100, and see if that gets everything.
repos <- get.organization.repositories(org = "STAT545-UBC", per_page = 100)
In the repos
result, see if the headers contain a link field:
repos$headers$link
If it does not, you have gotten everything! YAY! You're done.
If it does, you absolutely must traverse pages. Boo. We continue.
I get the total number of repositories by asking for all repositories, setting per_page to 1:
repos <- get.organization.repositories(org = "STAT545-UBC", per_page = 1)
Then you must parse the information in the link
field of the header. Here's where I've written functions to turn this into a tidy data.frame of links, one per row. You extract the page number of the link labelled "last" and you've learned how many repos there are.
You go back to asking for 100 repos per page. Determine how many pages the results will be broken into. Do this inside some iterative structure. Request the repos with per page set to 100, parse the link field of the headers to get the URL for the next page, Repeat until you're done. Then you also have to glue all the results together again manually, which was my motivation for this question I put out on Twitter.
In short, this is possible but a total pain in the butt.
I don't think the per_page
argument works for all API calls, for example : pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), per_page = 1)
returns the error: unused argument (per_page = 1)
.
@jennybc, would you mind pointing to where in your forked repo you have the workaround?
@aronlindberg I haven't pushed that stuff to my fork on GitHub. It's in a truly embarrassing state.
If you look at the header of the example of listing pull requests via the API, you'll see there are indeed the usual "next" and "last" links:
https://developer.github.com/v3/pulls/#list-pull-requests
so obviously pagination happens, as of course you have already discovered.
Looking at the relevant function in this package, I see there is no way for the user to pass per_page
:
https://github.com/cscheid/rgithub/blob/153bde1466252952e21c1fdb2ff0b452b9c8de99/R/pulls.R#L84-L85
Contrast that to the function for requesting a list of organization repositories, which uses the …
argument:
Best case, the definition of github::get.pull.request.files()
just needs to be modified to pass …
correctly and then the manual workflow outlined above is possible. Anecdotally, from my own usage, I think there are other functions that have the same problem.
I tried adding params=list(...)
to get.pull.request.files
so that it reads:
get.pull.request.files <- function(owner, repo, id, ctx = get.github.context()) .api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"), params=list(...))
However, calling get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), page = 1)
still generates unused argument (page = 1)
. I'm new at this - can anyone point me in the right direction for how to solve this? I looked at:
https://developer.github.com/v3/pulls/#list-pull-requests-files
and it looks like I should be able to pass the page
parameter there...
isn't it per_page
?
Neither works...Shouldn't I be able to pass both of them once the ...
parameter is implemented correctly?
I'm going to try changing def'n of get.pull.request.files()
and see if I succeed ….
OK that worked. I'll make a PR to @cscheid in a moment. Here you can prove it to yourself:
library(github)
# I store my access token in an environment variable
ctx <- create.github.context(access_token = Sys.getenv("GITHUB_TOKEN"))
## using rgithub "as is"
req <- get.pull.request.files(owner = "rails", repo = "rails", id = 572)
length(req$content) # by default, 30 items are retrieved
req$headers$link # I can see 30 items per page will imply we need 3 pages
## new definition of get.pull.request.files()
## I will make a PR to cscheid/rgithub momentarily
jfun <- function(owner, repo, id, ..., ctx = get.github.context())
github:::.api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"),
params=list(...))
jreq <- jfun(owner = "rails", repo = "rails", id = 572, per_page = 3)
length(jreq$content) # yep, it's honoring my request for 3 per page
jreq$headers$link # at this rate, we'll need 23 pages
Cool! =) So I was using:
ctx = interactive.login("client_id", "client_secret")
while you are using
ctx <- create.github.context(access_token = Sys.getenv("GITHUB_TOKEN"))
is it then that the token allows you to pass params=list(...)
but not client_id/client_secret?
No I don't think this has anything to do with our different approaches to authentication.
When you redefined get.pull.request.files()
above, the definition doesn't actually include …
in the formal arguments. So when you added params=list(…)
to the call to .api.get.request()
, R complains that it doesn't know what to do with additional argument, such as per_page
.
Aaaah! I see, yes, it works with my authentication too! The dots were just sooooo small, so I missed them. =) Thanks!
@aronlindberg I've made a PR now (#39). This change is reflected in my fork, in case you're in a huge hurry. You could install from my fork to get the functionality and switch back once it's merged.
@jennybc - thanks, I was able to implement the same change in my own fork. Thanks for having a dialogue about it, it helps me understand how these functions are written better. Now I can start trying to construct a manual pagination approach for the script I'm writing.
@aronlindberg Here's the function I have in my "private and embarrassing" branch I haven't pushed that's a good start on digesting link headers for the depagination operation :smile:
https://gist.github.com/jennybc/862a01dc9243118d83c9#file-digest_header_links-r
As you can see, I use some functions from plyr
and dplyr
, in addition to stringr
, which this package already imports. This data.frame is helpful for determining how many pages there are and getting the URL for the next page.
Thanks @jennybc that's helpful! It seems that since ...
can be passed, when I run get.pull.request.files(owner = "django", repo = "django", id = i, ctx = get.github.context(), per_page=1000)
I can get all the results that I want. Do you know if there is an absolute limit to the per_page
argument?
I thought it was 100. Are you actually getting 1000 items in one request?
Whoops. I think it's pretty clear that the limit is 100: https://developer.github.com/guides/traversing-with-pagination/ I'll have to start digging into your manual mechanism...
A rather messy example (mine) of how to implement @jennybc 's function can be found here: https://gist.github.com/aronlindberg/2a9e9802579b2d239655
Glad you are making progress @aronlindberg! I think that will work for your specific application (pull request files).
However, to build something into this package, you actually have to rely on the "next" links, rather than requesting specific pages via page = i
. Here's the relevant excerpt from the official docs re pagination:
Keep in mind that you should always rely on these link relations provided to you. Don’t try to guess or construct your own URL. Some API calls, like listing commits on a repository, use pagination results that are based on SHA values, not numbers.
So, using a while
loop and creeping from one page to the next looks virtually unavoidable. In the long-term and in the general case, @cscheid's inclination to create a general pagination wrapper seems promising.
Just skimming, but I don't think you should need computing on the language to do this - if the response object was a bit richer, I think you should be able to do autopagination by inspecting the request metadata.
I am now enjoying the page traversal @gaborcsardi put into gh
.
https://github.com/gaborcsardi/gh/blob/master/R/pagination.R
Oh wait, any reason we're using rgithub
instead of jumping ship to gh
?
EDIT: That one is a little too low-level for my taste. So we'll leave with rgithub
and its crumminess for now.
Maybe rgithub could rely on gh for the lowlevel stuff?
Just wanted to chime in. I did create a simple paginator for rgithub
that @hadley seems to be describing (the one that is inspecting request results in case the object was richer). I intended to submit it as a pull request as a proposal but given that this discussion has started already it may make sense to introduce it right now.
I used it a lot last few months and it seems to work reliably.
It could be used as follows:
# without paginator
first_30_forks <- get.repository.forks(owner_name, repo_name)
# with paginator
all_forks <- auto.page( get.repository.forks(owner_name, repo_name) )
Here is the code of that future pull request (given that it is short I am typing it here as a comment as well)
# Make automated paging till response is empty
auto.page <- function(f) {
f_call <- substitute(f)
stopifnot(is.call(f_call))
i <- 1
req <- list()
result_lst <- list()
repeat {
# Specify the page to download
f_call$page <- i
req <- eval(f_call, parent.frame())
# Last page has empty content
if (length(req$content)<=0) break
result_lst[[i]] <- req$content
i <- i+1
}
result_req <- req
result_req$content <- unlist(result_lst, recursive = FALSE)
(result_req)
}
@akhmed1, that looks great! @jennybc, @hadley: does that look sufficiently in style? I'm always wary of substitute
, eval
, and parent.frame
, so I'd prefer to get some expert reviewers to go over it :)
@cscheid great! Would love to hear the feedback! One word of caution:
rgithub
do not allow page
argument to be passed to Github while other functions do.Say get.organization.teams
is one example that does not. It is currently defined as:
get.organization.teams <- function(org, ctx = get.github.context())
.api.get.request(ctx, c("orgs", org, "teams"))
However, auto.page
needs an ability to pass page
argument to the API. (Otherwise, pagination can't work). The good news: this is a trivial change in current function declarations such as:
get.organization.teams2 <- function(org, ctx = get.github.context(), ...)
.api.get.request(ctx, c("orgs", org, "teams"), params = list(...))
The latter would work perfectly with auto.page
.
@akhmed1, that's issue #40 :)
One observation from the pagination workaround I was using with this package:
Once you deal the #40 problem, you can make an initial request with per_page = 1
. Then dig the total number of objects out of the link header for the "last" page. This way there's no mystery about how many things you can get. You can pick your own page size and make the real request(s) w/o having to grow objects or test for empty content. This may not really matter, but I liked it better this way.
@jennybc Your points are well-taken! Unfortunately, that approach did not work for me (we are using Github Enterprise): last
or next
page links were not reliable: sometimes empty strings were returned when other pages did exist. Not sure if Github.com had the same issues though.
That's weird! No I haven't had that problem on Github.com.
In Traversing with Pagination they try to scare us away from asking for specific pages by number:
Keep in mind that you should always rely on these link relations provided to you. Don’t try to guess or construct your own URL. Some API calls, like listing commits on a repository, use pagination results that are based on SHA values, not numbers.
but ... I've never accessed an exotic endpoint w/ non-numeric pagination. Sounds like you have something that works 😃.
Hello, is it me or the pagination doesn't work with GET /repos/:owner/:repo/stats/contributors
? looks like it is limited to 100 even on the repository graph page and the argument page=2
doesn't return the expected result.
Yes, this is an issue. As you can see, I've been unable to work on this for a long time now. I would really welcome a pull request if you have one. Sorry about that.
Closed (wontfix) by #70.
Have you thought about pagination at all?
https://developer.github.com/v3/#pagination
I'm not sure what the conventions are for API wrappers and assisting the user to fetch multiple or all pages. But the Ruby wrapper for the GitHub API talks about this in its README:
I'm thinking about this because I'm using
get.my.repositories()
and have realized that by default I'm getting the only the first 30. Before I start playing around with explicit requests for specific pages, I'm wondering if you are contemplating adding any auto pagination to your package…..