cscheid / rgithub

R bindings for the github API
MIT License
70 stars 31 forks source link

Pagination #30

Closed jennybc closed 6 years ago

jennybc commented 9 years ago

Have you thought about pagination at all?

Requests that return multiple items will be paginated to 30 items by default.

https://developer.github.com/v3/#pagination

I'm not sure what the conventions are for API wrappers and assisting the user to fetch multiple or all pages. But the Ruby wrapper for the GitHub API talks about this in its README:

For smallish resource lists, Octokit provides auto pagination. When this is enabled, calls for paginated resources will fetch and concatenate the results from every page into a single array.

I'm thinking about this because I'm using get.my.repositories() and have realized that by default I'm getting the only the first 30. Before I start playing around with explicit requests for specific pages, I'm wondering if you are contemplating adding any auto pagination to your package…..

cscheid commented 9 years ago

That's a good question.

This is not a particularly strong opinion, but all else being equal, I would prefer to let pagination be handled downstream. There are a few reasons for this:

  1. Right now rgithub has a very strong one-to-one equivalence between one call to rgithub and one HTTP request. This has the nice benefit that call rate is very predictable, and it's much less likely that you'll hit rate limits by accident.
  2. It's easy to increase page size. I think per_page works uniformly across GitHub's API, which means that (because of 1) per_page behaves uniformly in rgithub as well.
  3. It's more work :)

I would have assumed that GitHub itself wouldn't be fans of auto-pagination because it will hit their servers more often than necessary, but octokit is written by GitHubbers, so I'm clearly wrong there.

So let's look at it from the other side, and assume we want to implement auto-pagination. What would it look like in the API?

Option 1) We could make it a named parameter that is interpreted in a special way. This would mean that code which calls rgithub functions that return paginated results would change very little.

At the same time, I like that the base implementation of rgithub is very uniform: I don't want to have to add anything new to github.R: a general .api.request seems to be the right way to do it.

So the solution is to find all the HTTP entry points that return paginated results, and write a wrapper. Something like .paginated.api.get.request, and then have get.my.repositories, etc. all call .paginated.api.get.request instead of .api.get.request.

Option 2) For all entry points that return paginated results, create a new entry point that does auto-pagination. so you'd have get.my.repositories and get.my.autopaginated.repositories (bear with my terrible naming skills for now).

Which would you prefer?

jennybc commented 9 years ago

This is not a proper response, but adding another link where the GitHubbers really dig into the details of Traversing with pagination. This seems relevant to the discussion regardless.

jennybc commented 9 years ago

I haven't done enough of this manually yet to say how deeply I want this integrated into rgithub itself. So why don't I do that first? I appreciate the discussion. Even if you go with your instinct to keep this out of your package, maybe we could collaborate on a vignette that shows some idioms for, e.g., traversing all pages.

cscheid commented 9 years ago

Your mention of the idiom made me think of a general “pagination wrapper” call which uses “computing-on-the-language” (to use Hadley’s parlance) to do the right thing. Something like this:

get.my.repositories() is the nonpaginated version, but

auto.page(get.my.repositories()) actually does the right thing for paginating the inner call.

I think that should be doable, looks elegant, and wouldn’t require changing the base code at all.

What do you think?

jennybc commented 9 years ago

I think that is great idea! I will try to turn my evolving "manual" approach into that. I have already written helpers to parse the link headers, which will be generally useful for the auto.page() function you propose.

aronlindberg commented 9 years ago

Maybe this is a simple question, but I can't figure it out. How do I implement a "manual" approach? I'm trying to get a list of all files changed by a specific pull request: pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context()) however, it seems to stop at 30 entries/files or so. I've tried e.g. pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), page = 3, per_page=30) but it gives Error in get.pull.request.files(owner = "rails", repo = "rails", id = 572, : unused arguments (page = 3, per_page = 30)

How can I access them all?

jennybc commented 9 years ago

You will have to do all of this "manually". I have written some functions to facilitate this but this is all in my own local fork of this package and is very rough. Here's the sketch of what I do. In my case, I was trying to list repositories, but I would imagine the same thing works for pull requests.

First, max out the number of repos per page, i.e. set it to 100, and see if that gets everything.

repos  <- get.organization.repositories(org = "STAT545-UBC", per_page = 100)

In the repos result, see if the headers contain a link field:

repos$headers$link

If it does not, you have gotten everything! YAY! You're done.

If it does, you absolutely must traverse pages. Boo. We continue.

I get the total number of repositories by asking for all repositories, setting per_page to 1:

repos  <- get.organization.repositories(org = "STAT545-UBC", per_page = 1)

Then you must parse the information in the link field of the header. Here's where I've written functions to turn this into a tidy data.frame of links, one per row. You extract the page number of the link labelled "last" and you've learned how many repos there are.

You go back to asking for 100 repos per page. Determine how many pages the results will be broken into. Do this inside some iterative structure. Request the repos with per page set to 100, parse the link field of the headers to get the URL for the next page, Repeat until you're done. Then you also have to glue all the results together again manually, which was my motivation for this question I put out on Twitter.

In short, this is possible but a total pain in the butt.

aronlindberg commented 9 years ago

I don't think the per_page argument works for all API calls, for example : pull <- get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), per_page = 1) returns the error: unused argument (per_page = 1).

@jennybc, would you mind pointing to where in your forked repo you have the workaround?

jennybc commented 9 years ago

@aronlindberg I haven't pushed that stuff to my fork on GitHub. It's in a truly embarrassing state.

If you look at the header of the example of listing pull requests via the API, you'll see there are indeed the usual "next" and "last" links:

https://developer.github.com/v3/pulls/#list-pull-requests

so obviously pagination happens, as of course you have already discovered.

Looking at the relevant function in this package, I see there is no way for the user to pass per_page:

https://github.com/cscheid/rgithub/blob/153bde1466252952e21c1fdb2ff0b452b9c8de99/R/pulls.R#L84-L85

Contrast that to the function for requesting a list of organization repositories, which uses the argument:

https://github.com/cscheid/rgithub/blob/153bde1466252952e21c1fdb2ff0b452b9c8de99/R/repositories.R#L35-L36

Best case, the definition of github::get.pull.request.files() just needs to be modified to pass correctly and then the manual workflow outlined above is possible. Anecdotally, from my own usage, I think there are other functions that have the same problem.

aronlindberg commented 9 years ago

I tried adding params=list(...) to get.pull.request.files so that it reads:

get.pull.request.files <- function(owner, repo, id, ctx = get.github.context()) .api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"), params=list(...))

However, calling get.pull.request.files(owner = "rails", repo = "rails", id = 572, ctx = get.github.context(), page = 1) still generates unused argument (page = 1). I'm new at this - can anyone point me in the right direction for how to solve this? I looked at:

https://developer.github.com/v3/pulls/#list-pull-requests-files

and it looks like I should be able to pass the page parameter there...

jennybc commented 9 years ago

isn't it per_page?

aronlindberg commented 9 years ago

Neither works...Shouldn't I be able to pass both of them once the ... parameter is implemented correctly?

jennybc commented 9 years ago

I'm going to try changing def'n of get.pull.request.files() and see if I succeed ….

jennybc commented 9 years ago

OK that worked. I'll make a PR to @cscheid in a moment. Here you can prove it to yourself:

library(github)

# I store my access token in an environment variable
ctx <- create.github.context(access_token = Sys.getenv("GITHUB_TOKEN"))

## using rgithub "as is"
req <- get.pull.request.files(owner = "rails", repo = "rails", id = 572)
length(req$content) # by default, 30 items are retrieved
req$headers$link # I can see 30 items per page will imply we need 3 pages

## new definition of get.pull.request.files()
## I will make a PR to cscheid/rgithub momentarily
jfun <- function(owner, repo, id, ..., ctx = get.github.context())
  github:::.api.get.request(ctx, c("repos", owner, repo, "pulls", id, "files"),
                   params=list(...))

jreq <- jfun(owner = "rails", repo = "rails", id = 572, per_page = 3)
length(jreq$content) # yep, it's honoring my request for 3 per page
jreq$headers$link # at this rate, we'll need 23 pages
aronlindberg commented 9 years ago

Cool! =) So I was using: ctx = interactive.login("client_id", "client_secret")

while you are using ctx <- create.github.context(access_token = Sys.getenv("GITHUB_TOKEN"))

is it then that the token allows you to pass params=list(...) but not client_id/client_secret?

jennybc commented 9 years ago

No I don't think this has anything to do with our different approaches to authentication.

When you redefined get.pull.request.files() above, the definition doesn't actually include in the formal arguments. So when you added params=list(…) to the call to .api.get.request(), R complains that it doesn't know what to do with additional argument, such as per_page.

aronlindberg commented 9 years ago

Aaaah! I see, yes, it works with my authentication too! The dots were just sooooo small, so I missed them. =) Thanks!

jennybc commented 9 years ago

@aronlindberg I've made a PR now (#39). This change is reflected in my fork, in case you're in a huge hurry. You could install from my fork to get the functionality and switch back once it's merged.

aronlindberg commented 9 years ago

@jennybc - thanks, I was able to implement the same change in my own fork. Thanks for having a dialogue about it, it helps me understand how these functions are written better. Now I can start trying to construct a manual pagination approach for the script I'm writing.

jennybc commented 9 years ago

@aronlindberg Here's the function I have in my "private and embarrassing" branch I haven't pushed that's a good start on digesting link headers for the depagination operation :smile:

https://gist.github.com/jennybc/862a01dc9243118d83c9#file-digest_header_links-r

As you can see, I use some functions from plyr and dplyr, in addition to stringr, which this package already imports. This data.frame is helpful for determining how many pages there are and getting the URL for the next page.

aronlindberg commented 9 years ago

Thanks @jennybc that's helpful! It seems that since ... can be passed, when I run get.pull.request.files(owner = "django", repo = "django", id = i, ctx = get.github.context(), per_page=1000) I can get all the results that I want. Do you know if there is an absolute limit to the per_page argument?

jennybc commented 9 years ago

I thought it was 100. Are you actually getting 1000 items in one request?

aronlindberg commented 9 years ago

Whoops. I think it's pretty clear that the limit is 100: https://developer.github.com/guides/traversing-with-pagination/ I'll have to start digging into your manual mechanism...

aronlindberg commented 9 years ago

A rather messy example (mine) of how to implement @jennybc 's function can be found here: https://gist.github.com/aronlindberg/2a9e9802579b2d239655

jennybc commented 9 years ago

Glad you are making progress @aronlindberg! I think that will work for your specific application (pull request files).

However, to build something into this package, you actually have to rely on the "next" links, rather than requesting specific pages via page = i. Here's the relevant excerpt from the official docs re pagination:

Keep in mind that you should always rely on these link relations provided to you. Don’t try to guess or construct your own URL. Some API calls, like listing commits on a repository, use pagination results that are based on SHA values, not numbers.

So, using a while loop and creeping from one page to the next looks virtually unavoidable. In the long-term and in the general case, @cscheid's inclination to create a general pagination wrapper seems promising.

hadley commented 9 years ago

Just skimming, but I don't think you should need computing on the language to do this - if the response object was a bit richer, I think you should be able to do autopagination by inspecting the request metadata.

jennybc commented 9 years ago

I am now enjoying the page traversal @gaborcsardi put into gh.

https://github.com/gaborcsardi/gh/blob/master/R/pagination.R

cscheid commented 9 years ago

Oh wait, any reason we're using rgithub instead of jumping ship to gh?

EDIT: That one is a little too low-level for my taste. So we'll leave with rgithub and its crumminess for now.

hadley commented 9 years ago

Maybe rgithub could rely on gh for the lowlevel stuff?

realAkhmed commented 9 years ago

Just wanted to chime in. I did create a simple paginator for rgithub that @hadley seems to be describing (the one that is inspecting request results in case the object was richer). I intended to submit it as a pull request as a proposal but given that this discussion has started already it may make sense to introduce it right now.

I used it a lot last few months and it seems to work reliably.

It could be used as follows:

# without paginator
first_30_forks <- get.repository.forks(owner_name, repo_name)

# with paginator
all_forks <- auto.page( get.repository.forks(owner_name, repo_name) )

Here is the code of that future pull request (given that it is short I am typing it here as a comment as well)

# Make automated paging till response is empty
auto.page <- function(f) {
  f_call <- substitute(f)
  stopifnot(is.call(f_call))

  i <- 1
  req <- list()
  result_lst <- list()

  repeat {
    # Specify the page to download
    f_call$page <- i

    req <- eval(f_call, parent.frame())

    # Last page has empty content
    if (length(req$content)<=0) break

    result_lst[[i]] <- req$content
    i <- i+1
  }

  result_req <- req
  result_req$content <- unlist(result_lst, recursive = FALSE)

  (result_req)
}
cscheid commented 9 years ago

@akhmed1, that looks great! @jennybc, @hadley: does that look sufficiently in style? I'm always wary of substitute, eval, and parent.frame, so I'd prefer to get some expert reviewers to go over it :)

realAkhmed commented 9 years ago

@cscheid great! Would love to hear the feedback! One word of caution:

Say get.organization.teams is one example that does not. It is currently defined as:

get.organization.teams <- function(org, ctx = get.github.context())
  .api.get.request(ctx, c("orgs", org, "teams"))

However, auto.page needs an ability to pass page argument to the API. (Otherwise, pagination can't work). The good news: this is a trivial change in current function declarations such as:

get.organization.teams2 <- function(org, ctx = get.github.context(), ...)
  .api.get.request(ctx, c("orgs", org, "teams"), params = list(...))

The latter would work perfectly with auto.page.

cscheid commented 9 years ago

@akhmed1, that's issue #40 :)

jennybc commented 9 years ago

One observation from the pagination workaround I was using with this package:

Once you deal the #40 problem, you can make an initial request with per_page = 1. Then dig the total number of objects out of the link header for the "last" page. This way there's no mystery about how many things you can get. You can pick your own page size and make the real request(s) w/o having to grow objects or test for empty content. This may not really matter, but I liked it better this way.

realAkhmed commented 9 years ago

@jennybc Your points are well-taken! Unfortunately, that approach did not work for me (we are using Github Enterprise): last or next page links were not reliable: sometimes empty strings were returned when other pages did exist. Not sure if Github.com had the same issues though.

jennybc commented 9 years ago

That's weird! No I haven't had that problem on Github.com.

In Traversing with Pagination they try to scare us away from asking for specific pages by number:

Keep in mind that you should always rely on these link relations provided to you. Don’t try to guess or construct your own URL. Some API calls, like listing commits on a repository, use pagination results that are based on SHA values, not numbers.

but ... I've never accessed an exotic endpoint w/ non-numeric pagination. Sounds like you have something that works 😃.

Kinzowa commented 7 years ago

Hello, is it me or the pagination doesn't work with GET /repos/:owner/:repo/stats/contributors ? looks like it is limited to 100 even on the repository graph page and the argument page=2 doesn't return the expected result.

https://developer.github.com/v3/repos/statistics/

cscheid commented 7 years ago

Yes, this is an issue. As you can see, I've been unable to work on this for a long time now. I would really welcome a pull request if you have one. Sorry about that.

cscheid commented 6 years ago

Closed (wontfix) by #70.