aws-amplify / amplify-cli

The AWS Amplify CLI is a toolchain for simplifying serverless web and mobile development.
Apache License 2.0
2.81k stars 821 forks source link

RFC: `@searchable` directive Enhancements #7546

Closed cjihrig closed 1 year ago

cjihrig commented 3 years ago

This is a Request For Comments (RFC). RFCs are intended to elicit feedback regarding a proposed change to the Amplify Framework. Please feel free to post comments or questions here.

This document outlines a number of new features and improvements to the @searchable directive in Amplify CLI. The goal is to address some of the enhancement requests and bugs.

Proposal 1: Backfilling improvements

Exposing the streaming function:

Parameterize BatchSize and MaximumBatchingWindowInSeconds

Github Issues

The EventSourceMapping parameter of the Lambda function that streams data from the DynamoDB table to AWS ElasticSearch assumes a batch size of one. This enables the Lambda to update documents as soon as possible in AWS ElasticSearch. This is slow if you are trying to migrate a large amount of data. Moving forward, the BatchSize and MaximumBatchingWindowInSeconds will be parametrized to be overridden as desired. The default values will remain the same as the current values.

        "Searchable<ModelName>LambdaMapping": {
            "Type": "AWS::Lambda::EventSourceMapping",
            "Properties": {
             +  "BatchSize": {
             +      "Ref": "ElasticSearchStreamBatchSize" 
             +   },
             +  "MaximumBatchingWindowInSeconds": { 
             +      "Ref": "ElasticSearchStreamMaximumBatchingWindowInSeconds"
             +   },
                "Enabled": true,
                "EventSourceArn": {
                    "Fn::ImportValue": {
                        "Fn::Join": [
                            ":",
                            [
                                {
                                    "Ref": "AppSyncApiId"
                                },
                                "GetAtt",
                                "<ModelName>Table",
                                "StreamArn"
                            ]
                        ]
                    }
                },
                "FunctionName": {
                    "Fn::GetAtt": [
                        "ElasticSearchStreamingLambdaFunction",
                        "Arn"
                    ]
                },
                "StartingPosition": "LATEST"
            },
            "DependsOn": [
                "ElasticSearchStreamingLambdaFunction"
            ]
        },

Proposal 2: Querying enhancements

Sorting & Pagination

Github Issues

In the current schema generation, the Amplify CLI generates the sort input as an object. This only supports sorting over a single field. It is currently not possible to introduce another ‘tie-breaker’ field. This makes sorting more difficult when the field is not unique.

To overcome this limitation, this RFC proposes using an array for sorting. The nextToken field can remain a string that is encoded and decoded to an array as needed. The following example demonstrates how this will work. This is a breaking change.

type Query {
  searchTodos(
    filter: SearchableTodoFilterInput,
    sort: [SearchableTodoSortInput],
    limit: Int,
    nextToken: String,
    from: Int
  ): SearchableTodoConnection
}

input SearchableTodoSortInput {
  field: SearchableTodoSortableFields
  direction: SearchableSortDirection
}

type SearchableTodoConnection {
  items: [Todo]
  nextToken: String
  total: Int
}

An example query is shown below:

query MyQuery {
  searchTodo(
    limit: 10
    sort: [
        {direction: desc, field: name}
        {direction: desc, field: id}
    ]
    nextToken: "YWJjfDEyMzQ="
  ) {
    items {
      name
       id
    }
    nextToken
  }
}

The corresponding AWS ElasticSearch DSL query is shown below:

{
  "size": 10,
  "query": {
    ...
  },
  "search_after": "YWJjfDEyMzQ=",
  "sort": [
    { "name": "desc" },
    { "id": "desc" }
  ]
}

Aggregates

Github Issues

Search queries will also include an aggregates input to run aggregations on the data. Each supported aggregation will require different GraphQL types in order to accommodate various aggregation output signatures. Initially, only term, avg, max, min, and sum will be supported, but other aggregations can be added in the future. An example searchable schema for term aggregation will look like this:

enum SearchableAggregationType {
  TERM
  AVG
  MAX
  MIN
  SUM
}

type Query {
  searchTodos(
    filter: SearchableTodoFilterInput, 
    sort: SearchableTodoSortInput,
    aggregates: [SearchableToDoAggregationInput], 
    limit: Int, 
    nextToken: String,
    from: Int
  ): SearchableTodoConnection
}

type SearchableTodoConnection {
  items: [Todo]
  nextToken: String
  total: Int
  aggregateItems: [SearchableToDoAggregateBucket]
}

type SearchableToDoAggregateBucket {
  aggregateName: String!
  aggregates: [SearchableToDoAggregateResult]
  nextToken: String
}

union SearchableToDoAggregateResult = SearchableToDoScalarAggregate | SearchableToDoTermAggregate

type SearchableToDoScalarAggregate {
  value: Float!
}

type SearchableToDoTermAggregate {
  key: String!
  count: Int!
}

input SearchableToDoAggregationInput {
  aggregateName: String!
  aggregateType: SearchableAggregationType!
  fieldName: String!
  limit: Int
  missingInt: Int
  missingFloat: Float
}

An example search query is shown below:

query MyQuery {
  searchTodo(
    limit: 1
    sort: [
      {direction: desc, field: name}
      {direction: desc, field: id}
    ]
    aggregates: [
      { aggregateName: "nameAgg", aggregateType: "term", field: "name" }
    ]
    nextToken: "YWJjfDEyMzQ="
  ) {
    items {
      name
      id
    }
    aggregateItems {
      aggregateName
      aggregates {
        key
        count
      }
    }
    nextToken
  }
}

An example query response is shown below:

{ 
 items: [
 ...
 ],
 nextToken: "YWJjfDEyMzQ="
 aggregateItems: [
 {
   aggregateName: "todos",
   aggregates: [
     {
       key: "Get Milk",
       count: 4
     },
     {
       key: "take out trash",
       count: 7
     }
   ]
 }
] 

Proposal 3: Upgrade AWS ElasticSearch version to 7.10 for new projects

Github Issues

The AWS ElasticSearch version will be set to 7.10 for all new Amplify projects.

Proposal 4: AWS ElasticSearch field updates

Support for GraphQL enums

Github Issues

The Amplify CLI does not currently support searching using enum types. Going forward the fields will be written as strings and searched on as keyword fields.

All AppSync scalar types are streamed to AWS ElasticSearch

Github Issues

There are some gaps that need to be cleared to complete support for AppSync Scalar Types. Support for the following fields will be included to search over

Date/Time Types

AWSDate AWSTime AWSDateTime

The searchable input will use strings, but work with range queries with string formats defined by ISO 8601 . The field will be assumed to be dynamically typed as date.

input ModelDateTimeInput {
  ne: String
  eq: String
  le: String
  lt: String
  ge: String
  gt: String
  between: [String]
  attributeExists: Boolean
  attributeType: ModelAttributeTypes
}

String Types

AWSEmail AWSPhone AWSURL

Since these data types are typed as keyword fields, the generated GraphQL input would be identical to Strings.

input ModelStringInput {
  ne: String
  eq: String
  le: String
  lt: String
  ge: String
  gt: String
  contains: String
  notContains: String
  between: [String]
  beginsWith: String
  attributeExists: Boolean
  attributeType: ModelAttributeTypes
  size: ModelSizeInput
}

Numeric Types

AWSTimestamp

AWSTimestamps will be mapped to AWS ElasticSearch dates representing the number of seconds since the epoch.

input ModelIntInput {
  ne: Int
  eq: Int
  le: Int
  lt: Int
  ge: Int
  gt: Int
  between: [Int]
  attributeExists: Boolean
  attributeType: ModelAttributeTypes
}

IP Type

AWSIPAddress

AWSIPAddress inputs will be converted to IP types in AWS ElasticSearch.

input ModelAWSIPAddressInput {
  ne: String
  eq: String
  le: String
  lt: String
  ge: String
  gt: String
  between: [String]
  attributeExists: Boolean
  attributeType: ModelAttributeTypes
}

Other Types

AWSJson

These fields will be treated as text data in AWS ElasticSearch and strings in GraphQL.

input ModelStringInput {
  ne: String
  eq: String
  le: String
  lt: String
  ge: String
  gt: String
  contains: String
  notContains: String
  between: [String]
  beginsWith: String
  attributeExists: Boolean
  attributeType: ModelAttributeTypes
  size: ModelSizeInput
}

Proposal 5: Custom mapping support

In some cases it is desirable to customize the way that GraphQL fields are mapped to AWS ElasticSearch. For example, sensitive information may need to be redacted, or a field may need to be mapped as a different data type. This RFC proposes a field level directive used to modify the mapping functionality, as shown below. In this example, the name field would not be indexed by AWS ElasticSearch, while the email field would (redundantly in this case) be mapped as a string.

type Post @searchable {
  name: String @searchableField(index: false)
  email: AWSEmail @searchableField(type: string)
  address: String
}

Note: The name @searchableField and its API are still under discussion and subject to change.

Proposal 6: Local mocking

Github Issues

There are three primary hurdles to mocking @searchable:

  1. Ensuring that the customer has ElasticSearch running locally. The mock functionality will require customers to have ElasticSearch running locally. The port number will be configurable, but default to 9200. If ElasticSearch is unreachable, an error will be displayed to the customer. The mock server will create a new index on startup, and remove the index on shutdown.
  2. Updating ElasticSearch to reflect write operations in the GraphQL API. The GraphQL mock server will forward all write operations (Mutations) to the local streaming Lambda function. DynamoDB Local supports streaming. However, there would be no way to trigger a local Lambda function. A locally running component would be needed to intercept DDB streamed data and invoke the Lambda. This component would be invisible to customers. During mocking, the Lambda function will be configured such that writes from the function to ElasticSearch are sent to the customer’s local ElasticSearch instance.
  3. Querying data from ElasticSearch. The GraphQL mock server will forward search operations to the customer’s local ElasticSearch instance. Additional VTL functions, such as $util.transform.toElasticsearchQueryDSL() will need to be implemented as well.
sacrampton commented 3 years ago

Proposal 1 - Parameterize BatchSize and MaximumBatchingWindowInSeconds

At the moment I have to manual edit these values every time I push an update, so anything that parameterizes these values is a welcome change

Proposal 2 - Querying enhancements

A few years back I implemented my own solution to this - sharing in case its helpful. The nextToken stuff I needed to modify to use search after. I also had sort as an array. But I also put in a string type so that the resolver could sort keyword or other data types on a field by field basis.

I also implement dynamic authorization within the request resolver rather than rely on the dynamic auth in the @auth.

type Query {
  searchAssetsX(
    searchAfter: [String],
    limit: Int,
    sort: [SearchableAssetSortInput],
    from: Int ): SearchableAssetConnection  
}

input SearchableAssetSortInput {
  field: SearchableAssetSortableFields
  direction: SearchableSortDirection
  string: SearchableStringSort
}

enum SearchableStringSort {
  keyword
  none
}

Proposal 3 - Upgrade AWS ElasticSearch version to 7.10 for new projects

What do we do with existing projects? Will this upgrade to 7.10 or should we do that from the ElasticSearch console for existing projects?

Proposal 4: AWS ElasticSearch field updates

Not sure this is the right place to put it - but a major lacking thing is that the schema is not enforced in ElasticSearch. For example, the default field _lastChangedAt on a conflict enabled database is an Int in the schema/DynamoDB - but gets created as a Float in ElasticSearch. Another example might be if I have a string field in the schema that has a value in the first record we sync of "1.234" - it is created as a Float so if the second string record is "one point two three four" it fails as it is of the wrong type.

The index mapping needs to be explicitly defined for ElasticSearch and not inferred.

PUT /asset/_mapping/doc
{
    "properties": {
        "_lastChangedAt":{"type":"integer"}
     }
}
yaquawa commented 3 years ago

Hi, thank you for creating this RFC. I thought the project is nearly dead, so I'm happy to see the progress on this.

Please consider add this one into the RFC, https://github.com/aws-amplify/amplify-cli/issues/5121#issuecomment-675568032

currently, the value of total aggregation is definitely useless.

mgarabedian commented 3 years ago

I am happy there is work continuing on searchable. I haven't really been able to use it because I cannot mock with it so hopefully that will be addressed in some fashion. That being said, hopefully there could be some warnings if you accidentally comment out @searchable from the schema and deploy...which I have sadly done more that once.

yaquawa commented 3 years ago

What else Mapping parameters are you going to support in the @searchable? I'd like to see some other parameters like boost because currently you can't specify this in the query.

houmark commented 3 years ago

I feel like count of total items returned by a search query should also be part of this RFC. It's a very large limitation and constantly requested.

PeteDuncanson commented 3 years ago

Proposal 1: Backfilling - the title sounds like its going to allow for automatically populating the index if @searchable is added to an existing schema with data already in it. But then the text makes no mention of that. Can you confirm. The method to do that currently is very manual and easily over looked. Would love it if it was an option to reindex everything when added via the CLI.

Proposal 6: Mocking - Allow for an opt out of @searchable when mocking. When mocking locally I don't always need the ES stuff, often I'm trying to work on a Lambda which is where I use it the most. If I've got @searchable in my schema would I be required to have a local ES setup to mock at all? The way @searchable won't let you mock at all at currently without commenting it out of your schema is stopping speedy mocking locally. I'd rather there was a way to just say "ignore @seachable stuff locally" and still let me mock the rest. This is what was requested originally https://github.com/aws-amplify/amplify-cli/issues/5981 I think getting mocking of @searchable working is also a very good addition, I just want to way to skip it if I don't need to mock it for development speed and yet another thing to keep updating/learn about.

sacrampton commented 3 years ago

Hi Everyone - thought I'd share this comment on another related issue showing usage for custom searchable resolvers

aws-amplify/amplify-category-api#405

majirosstefan commented 2 years ago

Would be nice, if @searchable and Amplify CLI allowed devs to use external search services - Algolia, because of AWS Elastic/Open search pricing: https://github.com/aws-amplify/amplify-cli/issues/3860 (closed for some reason) and its non-serverless nature.

A very high-level overview of what I mean is that:

E.g. during amplify init, or Amplify update api, CLI would ask if we want to edit search logic manually - if yes, it would give you a couple of options:

• 1. Ignore @searchable during mocking on the local machine • 2. Use @searchable during mocking on the local machine (requires local ES/OpenSearch setup) • 3. Use your own search service (2 empty lambda functions will be generated) • 4. Override/specify ElasticSearch name/resource, that should be used for current env (e.g by using arn?)

Especially 3 is important for me.

  1. When chosen, it would go through all models annotated with @searchable.
  2. For each model: it a.) would generate a new Lambda function that would be attached as DynamoDB trigger for the given table for the given model - and it would be called onUpdate and onCreate operation in the corresponding table b.) would generate a second empty Lambda function that would be executed on searchQuery

These 2 Lambda functions, would be created by running the same flow as when running 'amplify add the function'.

I am already doing that manually, but if searchable was implemented like this, it would require less effort and time.

So if searchable finally added support for non-AWS services in general and finally fixed local mocking (which should be fixed on the day when searchable was introduced), it would be nice, I guess.

BTW, there are some forks that modifies searchable with custom logic to target non-AWS endpoints: https://github.com/starpebble/amplify-cli/wiki

bcs-gbl commented 1 year ago

Elastic/Open search pricing: https://github.com/aws-amplify/amplify-cli/issues/3860 (closed for some reason) and its non-serverless nature.

Just a side-info: Serverless OpenSearch service is coming (now its preview stage).

jeffking54 commented 1 year ago

Is serverless opensearch support coming any time soon?

majirosstefan commented 1 year ago

It's already available I think. https://aws.amazon.com/opensearch-service/pricing/#Amazon_OpenSearch_Serverless

But:

You will be billed for a minimum of 4 OCUs (2x indexing includes primary and standby, and 2x search includes one replica for HA) for the first collection in an account.

here are more opinions on this: https://www.reddit.com/r/aws/comments/zh2z4o/serverless_opensearch_seems_like_a_huge_deal_but/

github-actions[bot] commented 1 year ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.