aws / aws-sdk-go

AWS SDK for the Go programming language.
http://aws.amazon.com/sdk-for-go/
Apache License 2.0
8.63k stars 2.07k forks source link

Filtering paginated results is awkward. #667

Closed insasho closed 7 years ago

insasho commented 8 years ago

The pagination methods are thin helpers around the underlying Input/Output types and it invokes the callback once per "page" of results rather than providing a stream of unknown length. This differs from Boto3's approach of providing a richer collection model for paginated results. The richer collection model allows many problems (especially filtering) to be handled in much fewer lines of code. aws-sdk-go could benefit from having a similar filterable-stream based API for results that can be paginated. This is useful for API calls that don't natively support filtering parameters but do support pagination (EC2 DescribeInstances does, KMS ListAliases does not, etc).

Some ideas:

  1. Using Channels (async):

    // SDK generated code
    type AliasListEntryStream struct {
     // caller reads results from this channel 
     entries chan AliasListEntry
     // client writes to this channel when it is done consuming from the channel
     done chan struct{}
    }
    func (c *KMS) ListAliasesStream(input *ListAliasesInput) AliasListEntryStream {
    ...
    }
    ...
    
    // Developer code
    aliases := ListAliasesStream(nil)
    defer aliases.done <- struct{}{}
    var match AliasListEntry
    for entry := range aliases.entries {
    if entry.AliasName == expected { match = entry; break }
    }
  2. Using Iterators: http://ewencp.org/blog/golang-iterators/ is a nice summary of options.
  3. Filter builder: Use the API metadata to build a field-aware filter API:

    listAliasesStream, err :=  kms.ListAliases(nil)
    listAliasesStream = listAliasesStream.Where(Equals(AliasListEntry.Fields.AliasName, myAliasName))
  4. Using something similar to JMES filter expressions: JMES both projects and filters and is especially powerful on the command line. Its type-looseness doesn't align well with Go, but a string-based filter language could be concise and effective.
jasdel commented 8 years ago

Thanks for contacting us @insasho with this feature request. I think adding the ability to stream paginations would be helpful to have.

I think the ideas you proposed in 1,2 would most likely flow pretty well with how the SDK would implement them. 4 takes advantage of a DSL, but like you mentioned its string based, which could cause a significant lose of type safety the other two methods would preserve. For idea 1 we'd need to figure out how to ensure ensure goroutines and channels would not be easily leaked. Because of that option 2 might be the strongest. Also thanks for the iterators link, I'll read that over

Idea 3 could actually build separate on top of 1 or 2 fairly straightforwardly I think. Providing both a streamer and an optional filter functionalities.

We're also always glad to review PRs and discuss designs for how this could be implemented.

insasho commented 8 years ago

As your team has probably already done, I've experimented with various approaches to using async channels for paginating results. I'll share some findings here.

The Golang blog post on Pipelines and Cancellation proffers the minimum requirements for correct application of channels to the creation of a pipeline. Applying that model to the KMS's ListAliases call results in something like this:

type AliasListEntryStream chan *kms.AliasListEntry

func (c *kms.KMS) ListAliasesStream(listAliasesInput *kms.ListAliasesInput, done <-chan struct{}) (AliasListEntryStream, <-chan error) {
    out := make(chan *kms.AliasListEntry)
    errc := make(chan error, 1)
    go func() {
        defer close(out)
        errc <- c.ListAliasesPages(listAliasesInput, func(p *kms.ListAliasesOutput, _ bool) bool {
            for _, v := range p.Aliases {
                v := v
                select {
                case out <- v:
                case <-done:
                    return false
                }
            }
            return true
        })
    }()
    return out, errc
}

This pattern is also compatible with the golang.org/x/net/context pattern as the done acts as a cancellation signal which would allow the generalized timeout behavior.

However, this approach (whether using context or not), requires the user to write a few more lines of code and has higher cognitive burden than the existing pagination method which may not be worth the extra typing except in complex scenarios. Example:

done := make(chan struct{})  // user must create and may close this channel
defer close(done)  // Pages method will block until done is closed, which could hold server resources
out, errc := client.ListAliasesStream(done)  // user must pass done channel to all pipeline methods
for entry := range out {
  if *entry.AliasName == "aws/aws/acm" {
    return true  // deferred method signals generated code to stop iterating
  }
}
if err := <-errc; err != nil {  // complex error handling
  panic(err)
}
return false

This pattern does allow for some cute uses of channels (such as filters and tees and methods to materialize the collection). However, KMS' ListAliases may be similar to most of the paginated methods in that the size of the result set is small enough where the streaming nature does not yield results commensurate to the maintenance burden of including these methods in the core SDK. cloudwatchlogs.GetLogEvents may be the counter-example due to it possibly returning millions of entries, but is atypical of AWS APIs in my experience.

I experimented with implementing this pipeline-style method in the SDK code generator, which revealed a few more issues:

jasdel commented 8 years ago

Thanks for information @insasho, and looking into this further. The AWS SDK for Go will use the paginators-1.json file to determine how to generate paginators for their associated service client APIs. Currently the SDK does not use the result_key but may need to if streaming pagination where to be used. Only the AWS CLI and AWS SDK for Python currently use this field. result_key is complicated because it is not required to be a single field of the response. There are a few conditions where APIs' paginators have multiple fields are grouped or merged together. In the SDK this may require generating custom shapes for these multi field result_key.

I think streamed paginators in the SDK probably would like very similar to what you described.

For the results_key fields, correct the field is the name in the context of the output type of the API operation. For input/output tokens the SDK uses a bit of reflection to search for this field when processing if the request should be paged again. To do this for generated code with the results_key the SDK would have to search the associated API's output type for the fields. This is generally available via a ShapeRef's Shape.MemberRefs field, which enumerates all fields of a struct. Context is a good pattern to follow, but the actual net.Context won't be introduced until 1.7 and if the SDK were to implement streams we'd want it to be compatible with at least 1.5 and 1.6 also.

There are several APIs where streams make a lot of sense, like the ones you mentioned, and others like dynamodb queries, and listing S3 objects.

jasdel commented 7 years ago

Hi In PR #1132 I merged a change that adds a request.Pagination type to the SDK. This type allows you to control pagination much more easily your self in a pattern similar to bufio.Scanner.