aws-amplify / amplify-cli

The AWS Amplify CLI is a toolchain for simplifying serverless web and mobile development.
Apache License 2.0
2.81k stars 821 forks source link

Appsync api _version with Elastic Search @searchable directive (trigger fails). #3796

Closed lenarmazitov closed 4 years ago

lenarmazitov commented 4 years ago

Describe the bug When I add document with _version property to annotated @model+@searchable type of Graphql schema, trigger DbToEs fails with following error:

{
    "index": {
        "_index": "entity",
        "_type": "doc",
        "_id": "Google_123123123123123123|User",
        "status": 400,
        "error": {
            "type": "illegal_argument_exception",
            "reason": "Field [_version] is defined twice in [doc]"
        }
    }
}

Without _version property there are no errors.

ammarkarachi commented 4 years ago

@lenarmazitov Can you send us the schema we can reproduce the issue with?

lenarmazitov commented 4 years ago

@ammarkarachi

type Entity @searchable
    @model(mutations: null, subscriptions: {level: off})
    @key(fields: ["pk", "sk"])
    @auth(rules: [
        {allow: owner, ownerField: "owner", operations: [read]}
        {allow: owner, ownerField: "readers", operations: [read]}
        {allow: groups, groupsField: "groups", operations: [read]}
    ]) {
    pk: ID!
    sk: String!
    owner: ID!
    readers: [String]
    groups: [String]
    createdAt: Int 
    updatedAt: Int 
    user: Entity @connection(fields: ["owner"])
    children: [EntityItem] @connection(fields: ["pk", "sk"])
}

type EntityItem @model(queries: null, subscriptions: {level: off})
    @key(fields: ["pk", "sk"])
    @auth(rules: [
        {allow: owner, operations: [create, update, delete]},
        {allow: groups, groups: ["ALL"], operations: [read]}
    ]) {
    pk: ID! 
    sk: String!
    createdAt: AWSDateTime
    updatedAt: AWSDateTime
    owner: ID!
    user: Entity! @connection(fields: ["owner"])
}

This schema will autogenerate input types with property _version, this is ok, but if I try to insert some data to DynamoDb with _version property, the DbToEs trigger fails

lenarmazitov commented 4 years ago

Update I think this is problem with Elastic Search, I've just tried to add document with _version property and got same error. May be _version is some meta property. So currently it appears that to use @searchable directive with DeltaSync is impossible.

sacrampton commented 4 years ago

I opened an issue aws-amplify/amplify-cli#3818 which I now believe is identical to this one.

I believe that _version is some meta property of ElasticSearch and that it therefore not syncing.

Given that I've enabled Conflict Resolution with amplify update api, I have no idea how I can disable Conflict Resolution given that for now it does work with ElasticSearch and that ElasticSearch is core to our solution.

If _version is reserved in ElasticSearch, can we change the mapping in the Lambda function so that it maybe goes to version instead of _version in ElasticSearch?

ammarkarachi commented 4 years ago

@sacrampton You are right @lenarmazitov Elasticsearch already has a meta field that it uses to version documents as a work around you could exclude it from the root object.

sacrampton commented 4 years ago

@ammarkarachi - my app relies on searchable resolvers, so at this point I'm dead in the water and desperately need a work around (even if that is to back out/remove conflict detection). So can you give me specific advice on how to exclude the _version field from the root object in ElasticSearch???

sacrampton commented 4 years ago

@ammarkarachi - I really need a workaround here that allows DDB to stream to ES. Anything you can provide would be gratefully appreciated. Everything I see on the link you sent doesn't mention anything about excluding _version - talks about using external versioning which is more about how the lambda function needs to send its request.

ammarkarachi commented 4 years ago

@sacrampton I am working on it with the Appsync team I will definitely loop you in if there is a workaround we can offer

sacrampton commented 4 years ago

@ammarkarachi - thank you. Really need the work around - even if it is simply a process to temporarily remove conflict detection so I can get back to previous state where searchable worked.

ammarkarachi commented 4 years ago

@lenarmazitov @sacrampton Can you temporarily remove conflict detection? I am trying to get this resolved with Appsync

sacrampton commented 4 years ago

@ammarkarachi - I am more than happy to do that - but I can't see how to do that

kaustavghosh06 commented 4 years ago

@sacrampton In the file amplify/backend/api/<api-name>/transform.conf remove the "ResolverConfig" key and run amplify api gql-compile -> amplify push.

sacrampton commented 4 years ago

Thanks so much @kaustavghosh06 - that worked perfectly and I can now stream between DDB & ES

Look forward to seeing the fix to allow AppSync to be compatible with searchable. Good luck with it.

sacrampton commented 4 years ago

Hi @kaustavghosh06 / @ammarkarachi - can I get an update on this critical bug. It means that anyone that is using ElasticSearch cannot properly implement DataStore which is a HUGE issue and has my progress toward implementing DataStore and releasing my product into the market is dead in its tracks.

There are issues, then there are critical issues - its hard to get a more critical issue than killing ElasticSearch - so I'd be grateful if you could get me an update on this.

By removing conflict detection I could return my functionality to be non-DataStore enabled - but I really do need to be able to implement DataStore to get off-line first capability properly working.

At a minimum, if a fix to this is not imminent I suggest you have a big warning on https://aws-amplify.github.io/docs/js/datastore#conflict-resolution to tell people that this is NOT compatible with ElasticSearch and not to proceed with implementing it until it is fixed.

Thanks - appreciate any input you can provide.

undefobj commented 4 years ago

@sacrampton DataStore works with DynamoDB, not Elasticsearch. This is not a bug. We are looking at local FTS in the future but there's no timeline on this. It would be non-trivial to add another data source since the conflict detection and resolution system is built into AppSync as outlined in the documentation: https://docs.aws.amazon.com/appsync/latest/devguide/conflict-detection-and-sync.html

We can however in the meantime look at improving the workflows to the CLI to prevent (or flag a warning) you from using @searchable in your schema when you're using DataStore, as well as address the _version key conflict but there are some customers that want to use DataStore and API categories together so we wouldn't want to prevent them from that. Perhaps you could go into more details about your use case for using Full Text Search in an offline scenario and we could plan this for a future feature. Thanks.

renebrandel commented 4 years ago

@sacrampton - we've also added an additional note about this in our new documentation https://github.com/aws-amplify/docs/pull/1611

sacrampton commented 4 years ago

@undefobj - thanks for your email Richard. In this email I'll try to explain why your approach doesn't work for me, and how I got to this point. I have an app that allows people to collect asset data, and typically those places where they collect assets have intermittent network access. They generally don't have no access, and every year the quality of the network improves. In 6 years of doing this, clients have generally always been able to get on the network at least once a day, and often multiple times a day, in some of the remotest places on earth (mining sites, oil rigs, etc.). I would say even in my extreme cases clients have access to a network greater than 90% of the time, but we need to provide a solution that works 100% of the time. But its OK if the functionality in that 10% of the time is slightly lesser than when online as long as a base level of functionality is provided.

The main problem for AppSync is that you must have pre-cached a search for it to work off-line. So if I walk up to Pump P101A and search for "P101A" when offline I'll find nothing if it hasn't been searched for before I went off line - even if I've pre-cached that record. I created an issue in September last year (https://github.com/aws-amplify/amplify-js/issues/4085) which in essence said that I need to be able to some basic capability to do some basic searches of the off-line cache, not just rely on pre-cached searches. The answer for this was eventually "Data Store".

My App deals with geospatially distributed assets - so I have a capability built on ElasticSearch in the app to find all assets within a distance of where I'm standing. In fact I logged an issue related to this in Data Store (https://github.com/aws-amplify/amplify-cli/issues/3006) - so when I'm on-line (90% of the time) I can do geospatial queries. If you kill ElasticSearch my Geospatial Queries don't work.

My app deals with guys with fireproof gloves using an a phone encased in an explosion proof cover in extremely noisy environments. I want minimal key in and expect mistakes. So when I'm on-line I like to use fuzzy searching (ie. someone keys in "awger" and it returns all the "auger" records. If you kill ElasticSearch I lose this when on-line.

But the biggest problem for me is on the back-end. On the backend we make extensive use of ElasticSearch in building our web app. The native DynamoDB resolvers are just too primitive and inefficient to built our web solution around. For example, if I want to sort my records by manufacturer, model and serial number - trivial in ElasticSearch. I use dynamic authorization - in the DynamoDB resolvers all of that filtering happens in the response template due to the limited search capability of DynamoDB. Even something as simple as being able to return a total number of records that meet a query condition so I can decide whether or not I want to refine my query is trivial in ElasticSearch. I could go on and one with many scenarios for why we need ElasticSearch on the back end to deal with data captured on the mobile. Suffice to say that if ElasticSearch wasn't fulfilling a valuable function then it wouldn't exist so tightly coupled with DynamoDB. And if I use DataStore then I lose the ability to use ElasticSearch everywhere - not just mobile, but also on the web.

Seems ElasticSearch has the ability to use external versioning - but it just has to be enabled in your streaming function and then this problem is solved. Saying you can't use DataStore and ElasticSearch just doesn't work in my humble opinion.

sacrampton commented 4 years ago

@undefobj - this issue is related only to conflict detection. Can you definitively state whether DataStore will work & sync with AppSync if conflict detection is NOT enabled. If we can use DataStore without conflict detection enabled then we may still have a solution. But if conflict detection is mandatory then we are probably dead in the water with DataStore for the foreseeable future.

undefobj commented 4 years ago

@sacrampton Thanks for the detailed outline of your use case. You're correct in outlining some of the issues with caches. You have to run them in advance and they're static references, so you end up with a slew of other problems related to updates and referential integrity.

It's important to understand the functionality of DataStore. Caching solutions will run the cloud operation as you note and then pull the result set down which is what results in the impedance mismatch. Caches are not databases/stores. DataStore is an on-device repository which syncs with the cloud, meaning when you run query operations including the predicate filters it happens on the device. This means the mental model is you're just operating on the data and changes flow through the system, not thinking about local vs. remote.

You can do the sorting and filtering capabilities in DynamoDB with @key however I do acknowledge that's not built into the base sync features of DataStore yet. We're tracking this feature here and working on this now. So unless you're doing some specific Geo-search or other Fuzzy search features, you should be able to use filtering and sorting in DynamoDB and have those result sets replicated down to the clients at which point you can apply the local DataStore operations over the result set.

To do this and maintain consistency in the system we need to version objects, which also means that we need to control writes which happens through AppSync. It's explained in the following documentation and video:https://docs.amplify.aws/lib/datastore/how-it-works?platform=js Controlling these operations across different databases simultaneously is fraught with errors and you could end up with inconsistent data. This is also why Elasticsearch is a search engine and not a database/source of truth. It's great for results when you're online and pulling results down but for data syncing to your local device it's not appropriate. Just enabling a version system in Elasticsearch does not simply solve the problem as the versioning controls are not sequenced in a consistent manner. You will end up with incorrect data.

To then pull this back to your question, yes Conflict Detection and Resolution are required for DataStore which is why they were developed. Without a sync enabled endpoint there's no way for us to guarantee correctness in the system and track objects. I'm not sure why you see this as "dead in the water" as you can still continue to use your existing solutions from the past including the API category and implement your own persistence mechanism for now. I do understand you wanted to use DataStore and have the persistence automatically with Elasticsearch but that's not possible at this time. You should even be able to use API and DataStore categories in the same application, you just will need to implement offline persistence yourself for the API category results[1].

Circling back to future feature enhancements, my questions would be:

  1. Are you doing anything that needs specific Elasticsearch querying capabilities such as Geo-search or can you use @key sorting & filtering of DynamoDB?
  2. If you are leveraging specific full text search functionality, do they need to happen when the system is offline? We might be able to add this in the future but need to understand it better.
  3. If you're looking to just have the results populated back down to the client from Elasticsearch as offline persistence, would it be ok if it's "read only"? Meaning Elasticsearch results could come down to clients with a DataStore.query but you cannot update data. This might be a possible roadmap item.

[1] Another option could be out of band analytics against your Elasticsearch domain which writes back views to DynamoDB which you use for DataStore. This is a common architecture pattern. You'll need to just include the proper field information on the AppSync operations as outlined here: https://docs.amplify.aws/lib/datastore/how-it-works?platform=js#writing-data-from-the-appsync-console

undefobj commented 4 years ago

One note - I am looking into one possible solution that might allow for a write-through from DynamoDB to Elasticsearch, so the versions are controlled in DynamoDB. Then the mutations could still happen in DataStore offline and query results from Elasticsearch but we would need to translate some of the results. I don't want to totally dismiss this but it might be possible as a feature in the future.

sacrampton commented 4 years ago

@undefobj - thanks for your response Richard. Its obviously a complex issue, but lets see if I can make some headway in trying to clarify my issues. I do understand what you are saying about queries being on the DataStore with DataStore specific query commands. My issue however is that my solution is more than just mobile.

Mobile is just one part of my solution. The mobile app collects the raw data then I do all the processing and enhancement of that data on our web solution. Its really important that both the mobile and the web conform to the same schema and all access to DynamoDB from either environment is thus via AppSync - and that is working really well for me and we have it in production now.

As I said, to demonstrate this, let's forget about DataStore (mobile) for the minute and only look at our Web solution. On the web I make extensive use of ElasticSearch - essentially 100% of writes are to DynamoDB and 100% of reads are from ElasticSearch - and again this is working really well for me. DynamoDB searching is really limited whereas we've extensively customized the default search resolvers to deliver very powerful and flexible queries that are not even close to being possible on DynamoDB. All our reads and writes are through AppSync.

If we switch on conflict detection this dies because you are sending a field called version to ElasticSearch with the streaming function and this field conflicts with the ElasticSearch field of the same name. This causes the streaming function to fail and no data is streamed to ElasticSearch. The really easy fix is to simply change the streaming function between DynamoDB and ElasticSearch to tell ElasticSearch not to use its own versioning system - that it should use the DynamoDB versioning system and simply get the version number from there (https://www.elastic.co/blog/elasticsearch-versioning-support) - you wouldn't have to change anything in DataStore or any other mechanism. "To tell Elasticssearch to use external versioning, add a version_type parameter along with the version parameter in every request that changes data._ ".

If I decide to use DataStore on mobile then I'm not able to use ElasticSearch for my Web Solution (which I have already implemented on AppSync). Use of DataStore shouldn't preclude me also using ElasticSearch in other parts of my solution - that is what I mean by we are dead in the water today if we go the DataStore route.

undefobj commented 4 years ago

@sacrampton yes understood, the use of DataStore with Elasticsearch aside there is an issue with _version being a reserved key in the ES schema. I just spoke with an engineer about this and we can introduce a fix in the Python streaming function that when a syncable endpoint is enabled we strip this out of the payload sent from DynamoDB. We're going to look at making this point fix or something similar.

@brene @SwaySway @ammarkarachi let's please track this point fix so that customers can use the combined mobile/web use case, understanding that ES results do not sync down to DataStore clients.

sacrampton commented 4 years ago

@undefobj - in your separate message "...I am looking into one possible solution that might allow for a write-through from DynamoDB to Elasticsearch, so the versions are controlled in DynamoDB..."

That is precisely what I've been saying all along - it would appear that ElasticSearch allows this, provided your streaming function tells ElasticSearch this is the case. This to me should be a simple fix to this problem (I hope).

Today we hydrate the cache by doing ElasticSearch queries (ie. get all assets in a facility downloaded to the cache). It would be nice to allow ElasticSearch to hydrate a DataStore database - but that's not critical. What is critical is to simply stop the streaming lambda function crashing because of the _version field.

If you can get just get it where conflict detection is not killing ElasticSearch that will be a huge step forward.

undefobj commented 4 years ago

@undefobj - in your separate message "...I am looking into one possible solution that might allow for a write-through from DynamoDB to Elasticsearch, so the versions are controlled in DynamoDB..."

That is precisely what I've been saying all along - it would appear that ElasticSearch allows this, provided your streaming function tells ElasticSearch this is the case. This to me should be a simple fix to this problem (I hope).

Today we hydrate the cache by doing ElasticSearch queries (ie. get all assets in a facility downloaded to the cache). It would be nice to allow ElasticSearch to hydrate a DataStore database - but that's not critical. What is critical is to simply stop the streaming lambda function crashing because of the _version field.

If you can get just get it where conflict detection is not killing ElasticSearch that will be a huge step forward.

It's really not a simple fix. As noted we'll look into this in the future but this is a distributed storage system and we cannot lose customer data or have inconsistencies in the system.

sacrampton commented 4 years ago

@undefobj - didn't mean to diminish the size of the effort here - you are right, I'm not qualified to make that assessment.

Rather than strip out the _version field, would it be possible to enable external versioning in ElasticSearch when your function sends the data? That way the _version field in both DynamoDB and ElasticSearch would be the same.

If stripping out the _version is the only option at least it keeps it working - but it would be awesome if we could keep the _version the same in ES & DDB.

undefobj commented 4 years ago

Yes this might be possible, we assumed you didn't want that but if you do then we can look at making that the preferred mechanism.

sacrampton commented 4 years ago

@undefobj - just taking a moment to answer your specific questions:

Are you doing anything that needs specific Elasticsearch querying capabilities such as Geo-search or can you use @key sorting & filtering of DynamoDB?

=> The primary thing that we need on the app is Geo-search. If I'm online I can search for the ID's of records that are within a distance of where I'm standing in my app. I can then locate those ID's in DataStore for editing - and the edits on DataStore should sync back to DynamoDB.

=> I am not looking to do any ElasticSearch type queries when offline. However, if I did want to do a powerful ElasticSearch query when online I could similarly get that ID and locate that record ID in datastore for further editing.

If you are leveraging specific full text search functionality, do they need to happen when the system is offline? We might be able to add this in the future but need to understand it better.

=> I don't need this capability to happen offline. I don't have any expectation that this sort of capability - including geo queries - is available offline. Of course it would be nice if it did, but I don't expect that and can most certainly live without it.

If you're looking to just have the results populated back down to the client from Elasticsearch as offline persistence, would it be ok if it's "read only"? Meaning Elasticsearch results could come down to clients with a DataStore.query but you cannot update data. This might be a possible roadmap item.

=> I'm not overly fussed on how we populate/hydrate the DataStore cache. At the moment we hydrate our offline cache with @searchable queries and connected DynamoDB gets/lists (ie. search for all assets of a particular type at a particular facility and get connected photos, defects, classes & characteristics, documents, etc. for those searched assets.

It would be nice to hydrate DataStore from an ElasticSearch query. We never write back to ElasticSearch - it always goes back to DynamoDB and the lambda function sends it to ElasticSearch. But that is not a high priority. Hydrating the cache from ElasticSearch just gives you more flexibility to select items to go into the cache. My only priority is to stop the _version conflict stopping streaming from DynamoDB to ElasticSearch.

yuth commented 4 years ago

@sacrampton you could use the following workaround to get the ElasticSearch working with cloudformation. This would require you to edit the Elastic Search streaming lambda that populates the Elastic Search cluster from DynamoDB and update to request and response templates of search* query. The necessary changes are in this gist

You will need to create the an new zip file named ElasticSearchStreamingLambdaFunction.zip in amplify/backend/api/<api-name>/functions/ by zipping the content of python_streaming_function.py. The notable changes in the file to support version are in line 211-219, which updates the record to use external versioning.

The request template for search<Type> needs to be updated to include an additional param to ensure that the document version is returned in the response and the response template needs an update to ensure the elastic search returned version is used for _version field in the response

sacrampton commented 4 years ago

@yuth - thanks for your workaround. I started down this route too, but given how core this is I decided it was best to wait for a fix from AWS as this is critical capability for my solution and I want to make sure I go forward with the official, supported streaming function. I have no doubt that your approach works well and I thank you for sharing it.

yuth commented 4 years ago

@sacrampton Understood. This might take a couple of weeks to bake this into the CLI as this work needs to be added to our sprint.

edwardfoyle commented 4 years ago

Also note, if you go the route of implementing @yuth's solution, you will need to change the first part of the path on this line to be your ElasticSearch index name

sacrampton commented 4 years ago

I just wanted to update one of my comments (https://github.com/aws-amplify/amplify-cli/issues/3796#issuecomment-616931639) above related to caching data from ElasticSearch into DataStore. I've had a chance to consider this a little further and I'd like to outline a scenario.

Lets say my client is a Utility company that has 10 million assets spread across a country. The assets have photos, defects, characteristics, etc. - so I don't want to download 10 million assets to the cache of my device. What I do is define a work package, and say these 5,000 assets and their related objects in a specific geographic area where I'm planning to work should be downloaded to the cache. Imagine this is a suburb or a city in state that the Utility services.

Now if I'm at the border of suburbs, and I have network coverage, I can do an ElasticSearch query to find assets within (say) 1 mile of where I'm standing. Some would be in the suburb I previously cached, and some would be in the neighboring suburb that I have not cached. I am only editing records in DataStore, so what I'd ideally want is to have the items I just found in the neighboring suburb added to the cache so I can start editing them.

We do this in AppSync today - before they go to a suburb we do a query for the work package so they AppSync cache is hydrated - and any ad-hoc ElasticSearch queries out in the field are dynamically added to the local cache.

ElasticSearch is the best way to search large databases - DynamoDB is very limited - ideally these ElasticSearch searches are dynamically added to the DataStore cache. Then all edits happen on DataStore, which flows back to DynamoDB and then flows to ElasticSearch. I think this is a higher priority request when considering caching sections of large databases and needing to augment the cache seamlessly whilst in the field when network is active.

sacrampton commented 4 years ago

@edwardfoyle @yuth - is this fix available yet? Looks to be getting close.

edwardfoyle commented 4 years ago

@sacrampton The fix will be available in the next release

edwardfoyle commented 4 years ago

This fix has been released in Amplify CLI v 4.19.0. With both DataStore and @searchable enabled in your project, you'll need to run amplify api gql-compile then amplify push to have the streaming function and search resolvers updated to work with ES external versioning.

sacrampton commented 4 years ago

Hi @edwardfoyle - I have tested this and you have made some progress

I tested this with a deleted record - in which case _deleted is set to true and _ttl is added. But it fails when you try to stream a record with a _ttl field - see CloudWatch error. So at the moment we can't deal with deleted items in ElasticSearch.

            "status": 400,
            "error": {
                "type": "mapper_parsing_exception",
                "reason": "Field [_ttl] is a metadata field and cannot be added inside a document. Use the index API request parameters."
            }

Then once we get the above fixed we need to be able to query in list & search to be able to filter for _deleted and then be able to "undelete" by setting _deleted & _ttl = null

github-actions[bot] commented 3 years ago

This issue has been automatically locked since there hasn't been any recent activity after it was closed. Please open a new issue for related bugs.

Looking for a help forum? We recommend joining the Amplify Community Discord server *-help channels for those types of questions.