aws-amplify / amplify-category-api

The AWS Amplify CLI is a toolchain for simplifying serverless web and mobile development. This plugin provides functionality for the API category, allowing for the creation and management of GraphQL and REST based backends for your amplify project.
https://docs.amplify.aws/
Apache License 2.0
89 stars 78 forks source link

Elasticsearch Lost All Data #165

Open Rafcin opened 3 years ago

Rafcin commented 3 years ago

Before opening, please confirm:

How did you install the Amplify CLI?

No response

If applicable, what version of Node.js are you using?

No response

Amplify CLI Version

5.4.0

What operating system are you using?

Ubuntu 20

Amplify Categories

api

Amplify Commands

Not applicable

Describe the bug

Quick and simple issue. It's 10pm and my production ES service created by Amplify decides that it want's to spike to red health and it crashes randomly, I hadn't touched anything all day for my project, the most I did was modify some css. ES takes a shit for almost 30 minutes and finally once it cleaned itself up and the 502 error went away I log into Kibana and Kibana funny enough tells me all my streamed data from DynamoDB is gone.

I am now left with no explanation as to why ES crashed and why Kibana lost all data.

I went into cloudwatch, I scoured all of AWS and I couldn't find a problem. The CPU usage spiked a bit with high traffic but that's it.

Also how should I go about putting back all my data? I have from now till 6am or however long my brain has.

Expected behavior

ES should never crash.

Reproduction steps

NA

GraphQL schema(s)

NA

Log output

NA

Additional information

No response

lazpavel commented 3 years ago

Hello @Rafcin, really sorry to hear this.

In addition to the information above, were there any additional customizations to the ES resources beyond what Amplify had generated? Were there a recent deployment?

Rafcin commented 3 years ago

There was no additional modification, I used the ES resource as it was. Sorry for not adding anymore information this occured late at night and I was focused on writing a script to implement all the data back to ES.

I'll log into the console and pull up all the logs for this issue.

A quick summary of the project however, ever since I created the ES instances they've always been stuck and a yellow level since creation, why I have no idea. Apart from that I've been using ES for a year now and this is the first time it's occured. The version it's currently set to is R20210426-P2, it should be the latest, the last major update was a month ago I believe.

I also noticed a small CPU spike while sitting through the logs, however it seemed normal as it does that from time to time.

Whatever information you need for this issue let me know, I'll be online as long as needed. Hopefully no one else experiences this issue 🙏.

houmark commented 3 years ago

This same exact incident happened to us early hours UTC Monday.

We reached out to AWS Elasticsearch support and it turns out that the t2.small.elasticsearch instance that Amplify auto-created when adding the @searchable directive is CPU limited and when the instance uses too much CPU over a period of time, the instance is shut down due to CPU credits having been exhausted. If you only have one node, data will be lost, but Elasticsearch does create a new node. This is at least according to the investigation done by the AWS Elasticsearch supporter and we believe that is the right cause.

There are many concerns in relation to this. I intend to file a bug report with questions related to this soon, as the AWS Amplify support was not really helpful in clarifying some questions we had related to increasing capacity. Changing to multiple nodes and large instances, will result in some serious cost increases and it does not seem possible to configure a smaller Elasticsearch single node instance for a development environment and a larger multi-node instance for production. In addition to that, Reserved Instances could lower costs by a longer (upfront paid) commitment, but it is also unclear if it's possible to use Reserved Instances with Amplify provisioned Elasticsearch instances.

Rafcin commented 3 years ago

@houmark Thank you for mentioning this!! That worries me a lot now that my site will start gaining more traffic and users will be adding more data. I looked around and unfortunately had error logs disabled and couldn't find anything useful. I did notice the CPU spike during the crash so that makes sense. This kind of has me shitting my pants now that i'm deployed in production, I've had 500 something users in the last day and I mean fine I have a script now to reinsert the data (apparently I also could have just updated any attribute and it would have ran the ddbtoes lambda as well, I was just an idiot at 10am) although that's not exactly something I should have to monitor and fix in a deployed app.

That being said I think once the amplify team takes a look at this I have full confidence they will introduce a solution fairly quickly, they haven't let me down thus far.

lazpavel commented 3 years ago

Hello @Rafcin

We are investigating the issue. In the meantime, can you give it a try to this script here in order to recover your data.

Rafcin commented 3 years ago

@lazpavel Much appreciated!! I ended up writing a new script that updates the timestamp of the last time the DB item was updated and that automatically fires the lambda and reinserts it to ES, and it does that in a set of 50 at a time. On the bright side should it happen again I can just run the script, however it's not ideal, I also use a custom mapping so I have to go into Kibana each time and reinserts mapping for Geopoint and stuff like that.

If you can please keep me updated on a fix, I'm terrified it might happen again but during the hours people use the site and I really would like to implement a fix to make sure it doesn't happen again 😭.

lazpavel commented 3 years ago

Can you provide us cluster details (account number and domain name). Th engineers would need that to get more information. Please send send it to us via amplify-cli@amazon.com.

Rafcin commented 3 years ago

@lazpavel I sent over the details. Also i'm wondering how will this issue be handled in the future?

With ES for example thinking that ES won't crash again is naive if it moves to a larger instance, so i'm wondering will there be any tools added to fix an issue like this in the future. My second method to fix the data was simple just update all the values and the lambda fixes the issue. I work on my current project as a one man team and I can't always be on call to fix the issue if it happens so i'm curious will a tool be added to automatically reinsert the data should the ES instance crash?

I also have a theoretical question, say I gain a ton of users for my sight and just hypothetically I accumulate say a 1TB of data. If this issue happens again what do I do? That's 1TB of data, I can only imagine the bill for when all that data needs to be reinserted.

And one more question slightly related to this. Say I plan to shift away from Dynamo and go to Aurora. I haven't checked in a bit but can I use ES with Aurora? I don't plan to at this time but it's just nice to know.

lazpavel commented 3 years ago

Hi @Rafcin, thank you for the information provided and for your patience.

I communicated it further to the ES team, including your concerns. I will get back to you asap I have updates.

Rafcin commented 3 years ago

Thank you!

lazpavel commented 3 years ago

Hello @Rafcin,

The team is still investigating the root cause. In the meantime they advice to not use T2/T3.small instances as per best practices here: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp.html and https://aws.amazon.com/premiumsupport/knowledge-center/opensearch-node-crash/

Also, we have automated snapshot for data backup in order to recover the data from it. Details can be found in https://www.elastic.co/guide/en/elasticsearch/reference/7.14/snapshot-restore.html

parvusville commented 3 years ago

It would be really nice to have configuration settings for each environment. Upgrading to a bigger instance and having that happen also in dev and staging feels unnecessary and expensive.. Seems I had my small instance crash as well between 09/12 and 09/13. My CPU utilization rose from max. 20% to 50% at crash time and 5 days after that.

lazpavel commented 3 years ago

Hi @parvusville, It is possible for you to configure the instance type per environment, we are currently working on updating the documentation. Here are the steps:

  1. Run amplify env add to create a new environment (e.g. "prod")
  2. Edit the amplify/team-provider-info.json file and set ElasticSearchInstanceType to the instance type that works for your application
    {
    "dev": {
      "categories": {
        "api": {
          "<your-api-name>" : {
            "ElasticSearchInstanceType": "t2.small.elasticsearch"
          }
        }
      }
    },
    "prod": {
      "categories": {
        "api": {
          "<your-api-name>" : {
            "ElasticSearchInstanceType": "t2.medium.elasticsearch"
          }
        }
      }
    }
    }
  3. Deploy your changes with amplify push

You can check the Amazon OpenSearch Service instance types here.

julien-tamade commented 3 years ago

This just happened to me as well. Any way to prevent this in the future. Assuming the correct solution is to update the instance type to something larger and then find a way to backfill the records?

And what CLI Version does this per environment instance type trick work on?

Rafcin commented 3 years ago

Building on what @julien-tamade said. If I move to a larger EC2 will I have to populate all my data again? Also I was under the impression that the EC2 instance would automatically scale based on use, is this not the case?

lazpavel commented 3 years ago

Hi @Rafcin, @julien-tamade You can change the ElasticSearchInstanceType without data-loss, so you don't need to repopulate the data. The ElasticSearchInstanceType does not automatically change.

https://aws.amazon.com/premiumsupport/knowledge-center/opensearch-scale-up/

julien-tamade commented 3 years ago

@lazpavel

Thanks for the response. I tried setting the instance type in the team-providers file but when I went to push it detected no changes so i did a push with --force, but there's no changes in opensearch after. How can I get it to actually create a new search instance?

houmark commented 3 years ago

@lazpavel Happy to see some focus on this. As mentioned, we have also had downtime and data loss recently, so this may be affecting more people than it used to.

A few questions and improvement requests in relation to this:

GeorgeBellTMH commented 3 years ago
lazpavel commented 3 years ago

Hi @julien-tamade, can you please send the following to amplify-cli@amazon.com:

  1. the output of amplify env list
  2. the contents of amplify/team-provider-info.json
  3. the output for amplify push --force

Thank you

lazpavel commented 3 years ago

Hi @houmark and @GeorgeBellTMH,

Thank you for your feedback, we are looking into it and will get back to you after we have updates

julien-tamade commented 3 years ago

@lazpavel

email sent. thanks!

julien-tamade commented 3 years ago

update:

The trick to specify different instance types in team-provider-info.json worked. I had provided an incorrect api name. Using amplify push --force worked

AlessandroVol23 commented 3 years ago

@julien-tamade thanks for clarifying! Is the correct API name the one the folder is named in amplify/backend/api/api-name? Is it also the one you see in the summary of amplify status?

Rafcin commented 3 years ago

@lazpavel It's happening again...

I have a production instance on a t2.medium and my dev on t2.small. The medium seems fine but the t2.small is shitting the bed and it lost all its data. Are there any updates on a fix for this? You mentioned we should avoid the small instances, and as much as I would like that I also can't afford to run 2 medium instances with opensearch...

AlessandroVol23 commented 3 years ago

I had the same issue with the t2.small several times but I don't think that a fix will come for that since they are burstable. I'd use the medium instance (for 60 € / month 😤) everywhere where the data should persist.

warrenmcquinn commented 2 years ago

This thread was invaluable in recovering from an surprise ES/OS purge. Production instance upgraded to medium, and backfill python script saved the day. Thanks to all who posted comments above.

Rafcin commented 2 years ago

Docs should have a warning setup as well. I think the CLI gives you a warning that a small may have performance issues but the loss of data should be added to that warning as well.

On Wed, Nov 24, 2021, 2:32 PM Warren McQuinn @.***> wrote:

This thread was invaluable in recovering from an surprise ES/OS purge. Production instance upgraded to medium, and backfill python script saved the day. Thanks to all who came before in the comments.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <aws-amplify/amplify-category-api#165>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQCG65JDGZDURCHIUJZS6DUNVRWRANCNFSM5DT46DSQ .

Rafcin commented 2 years ago

@lazpavel update. I switched to a medium instance and I don't have too many records (< 200) and the elastic search instance killed itself again.

@warrenmcquinn Medium is also unstable it seems..

warrenmcquinn commented 2 years ago

@Rafcin our medium instance just lost all data too, so it doesn't seem to be a cure-all.

Rafcin commented 2 years ago

@warrenmcquinn

That's a bummer, I'm so sorry :(

I'm experimenting with a solutions for this, it's driving me insane as well. I was thinking a lambda could be setup to check if the data is lost every X hours although that's expensive and stupid.

The OpenSearch issue seems to have stemmed from the last & forked version of ES AWS used. Before that it wasn't an issue. So the bug came from Elastic.

I've been thinking about this a lot, OS is a core feature of my site, and when OS works my search goes down, certain pages go down and it's a nightmare. I can't figure a good way to setup A. A fallback, B. Automatically reindex the data.

There's no cheap efficient solution to rei dex the data, in all cases you'd need a lambda or something that triggers when data is lost, but even then you either need to update the DB to reindex or run the reindex script and both are expensive to run for a large dataset.

Also OpenSearch as far as I'm aware won't scale up as your data scales, so if you launch a product that has a bunch of data it will eventually run into the same crash issue unless you upgrade to a larger container.

Edit I guess you could have a lambda to check if the data is there and then use the snapshot feature to restore it.

Rafcin commented 2 years ago

@warrenmcquinn Maybe the solution to the problem at least until the issue is resolved would be to add a new lambda that gets created with the searchable-transformer stuff to handle automatic snapshots and restorations.

I think a viable although most likely inefficient temp solution would be to have a lambda that queries OS just to see if the instance has data or is up and if not restore a snapshot. If the data is there and it's working fine then just create a snapshot.

parvusville commented 2 years ago

Yep, lost data on medium too. :/

man517 commented 2 years ago

Any updates on a robust solution to this issue? My elastic search instance just lost all of its data, and my mobile app is heavily reliant on elastic search functioning properly. As others have mentioned earlier, having to reindex all of my data is quite inconvenient for both me and my users.

houmark commented 2 years ago

Until AWS devs come up with a better solution, the only safe solution to avoid full loss and reindexing is by upgrading your instances to c4.large.elasticsearch and increasing the instance count to 2. Yes, this setup is $220+/mo for one environment, but you can at least configure Amplify to use a different instance type and count for a development environment to save on cost but also accept that the search instance can get shut down for that environment then. That's expensive but we did this months back and have had no issues since. We do feel we are overscaled with this setup but even t2.medium.elasticsearch can get shut down due to too much CPU usage over a short period resulting in data loss and a new instance being automatically created but it then needs to reindex all data.

We've tracked that there seems to be some event happening at random times where the instance will increase CPU usage for a longer period (maybe instance upgrade or something else unrelated to normal search usage) and this event can be the trigger of too much CPU usage which then shuts down the instance.

wrsulliv commented 2 years ago

Over the past several months I've been losing OpenSearch data. About a month ago, all my data was lost, and I did not notice until just today.

Automated snapshots only go 14 days out for my version - so I don't have a backup. Is there any way to recover this data?

Is there any recommended workaround for this? Do newer versions of OpenSearch have the same problem?

I'm consider switching to Dynamo, but I don't understand what's causing these issue with OpenSearch, and I'm skeptical of related products.

kylekirkby commented 2 years ago

@josefaidt - what is the recommended solution from the Amplify team? 🙏🏼

It's not really ideal to have your data dropped and not automatically re-indexed.

It would also be beneficial to update the Amplify docs on this issue.

Rafcin commented 2 years ago

@kylekirkby the solution is to really just upgrade your instance or use the OpenSearch/Elastic backup and restore feature. You don't really have many other options. Upgrading your instance and setting clusters up is the ideal solution although it will run you more.

DerekFei commented 2 years ago

I have had this issue since 2020 and it's unbelievable that AWS is not able to resolve this after 2+ years. DynamoDB is pretty bad at search and using OpenSearch is the only way out. I'm switching to MongoDB Atlas (it supports native text search) entirely for the alternative. (it's way cheaper to use Atlas than using even 2 * t2 opensearch instance for prod and dev.

DerekFei commented 2 years ago

@alharris-at Can you provide more information about the potential fix for this issue? It's making @searchable basically useless for non-enterprise projects which don't have a significant amount of traffic to use multiple large instances. It's also hurting the adoption of DynamoDB since most of the projects will require some sort of search feature. I see you changed the label from P1 to P3. I was very surprised about the deprioritization for such a critical issue.

GeorgeBellTMH commented 2 years ago

So I wonder if someone could develop a lambda that did some sort of comparison to validate all data matches...some ideas:

  1. Get a sha2 of the entire table in dynamo, and a sha2 of the table in open search...if they don't match...wipe and resync...(may be expensive)
  2. Do the same as 1, but instead of checking entire table check X% of the rows, resync if something doesn't match (reduces cost for big tables potentially)
  3. Compare sha2 of rows by index...if they don't match resync just that row (this would be good if you know just a few items are out of sync)
  4. Force resync at specific times or in specific cases (ie if you notice the tables in open search are empty when they should have content)

Essentially this would be an open search doctor lambda...repairing stuff in the background when there are problems...

DerekFei commented 2 years ago

@GeorgeBellTMH I like your idea. However, it will still have availability issues during those resync processes. Imagine search will constantly down for the project. People choose dynamoDB because of the availability and on-demand pricing. Now, AWS is essentially telling people: hey your search will have terrible availability and if you wanna solve that, get the largest instance and pay us $$$$$. It defeats the whole purpose of dynamoDB

GeorgeBellTMH commented 2 years ago

Yes, I think for a full table resync you'd have to do something where you targeted a temporary table and then when it was done swapped over to using it...maybe there could be a scoring system or threshold when a full sync happens (some obviously simple ones like row count, or ensuring the number of inserts/deleted/updates match....)

GeorgeBellTMH commented 2 years ago

I guess the other solution is figuring out some of the problems with the current sync and resolving them...not sure if the current implementation is using queues, or what causes the data loss...I assume it's something to do with availability of all the services, and recovering well when they come back online....

Rafcin commented 2 years ago

Is there really a solution though? Even if you're using a t3 small open search /elastic search instance, OS/ES is fundamentally flawed, They simply cannot run on small instances. Also, it seems anyone that decides to run instances like that, it's recommended normally that they have these backup nodes in case s*** like this happens. Unfortunately, this does make adopting DynamoDB for a smaller projects quite difficult. For smaller projects, it's probably worth trying something like MariaDB or that Apache database Discord uses.

At this point, open search really only makes sense for large scale projects. Something like Airbnb. Airbnb, switched over two open search but at the same time Airbnb is Airbnb and the amount of servers they have spun out for that definitely ensures they won't run into issues although I bet it costs them a s*** ton.

DerekFei commented 2 years ago

Can't agree more @Rafcin! May the force be with Airbnb so that they won't lose all of their search results when demand surges!

hisham commented 2 years ago

Just started to use searchable in my project and agree with all the sentiments here.

  1. My "Cluster Health" is already Yellow and I barely started. I filed https://github.com/aws-amplify/amplify-category-api/issues/905 for this. AWS Supports tells me to increase node count to greater than 1 but how do I do that through Amplify?
  2. Adjusting instance size in team-provider-info.json does not trigger any outstanding update in amplify status. I have to do amplify push --force.
  3. The 3 second sync lambda timeout is way too small and causes the backfill script to not work silently. It needs to be increased to 10 seconds or so.
  4. Rebuilds of OpenSearch take a long time in case you need to increase instance size or make other configuration changes (e.g. enabling logging alone requires a rebuild and a 10-15 min wait)
  5. For a developer new to it, OpenSearch seems to need a lot of babysitting and has a learning curve associated with it (nodes, shards, instance sizes, cluster health, on-going costs, etc...). Very different from dynamodb serverless experience.

For all these reasons I think usage of @searchable directive should be minimized in production. The whole setup just feels too brittle.

Rafcin commented 2 years ago

@hisham, I think you hit all the big points.

This is tough, if you use Dynamodb, you only have OpenSearch as a simple existing solution. You don't have many options. Currently, it's recommended to use a t3.m or above for prod. In my prod setup, I use t3.m with 2 nodes. The problem with t3.small, even with a few documents, you get a memory issue, and the instance crashes, and then you have to reboot and repropagate the data. The problem with scaling up is cost, it's just so damn expensive. The big issue is this isn't an Amplify issue; it's the OpenSearch team's issue, and if we wanna be specific, it's ElasticSearches fault in the first place. It runs like shit because they never intended it to be used in such small instances. It works on sites like Airbnb because they have those large instances set up, and so do most prominent businesses. Hopefully, the OpenSearch devs at AWS will look into this and make it more accessible for smaller devs.

Regarding the team provider file, I thought they fixed it so you could update the settings from the file, but it's probably better to do it through the console anyway; Amplify isn't going to manage anything when it comes to OpenSearch scaling. (I think...)

I don't think you can do much about the speed of upgrades and settings changes. That's just how it is. Updates take 10-20min as well.

csvalavan commented 1 year ago

I noticed an announcement from Amazon about a serverless open search(https://aws.amazon.com/about-aws/whats-new/2022/11/announcing-amazon-opensearch-serverless-preview/).

Is there a plan to implement @Searchable serverless option?

Rafcin commented 1 year ago

@csvalavan I can't speak for the Amplify team but I brought this up and they said they're speaking about this internally. There are a lot of considerations that need to be made in terms of pricing.

Pricing as of now is: OpenSearch Compute Unit (OCU) - Indexing
$0.24 per OCU per hour

OpenSearch Compute Unit (OCU) - Search and Query $0.24 per OCU per hour

Managed Storage $0.024 per GB / month

And a standard t3 for comparison is: t3.medium.search 2 4 EBS Only $0.073/hr