Open Rafcin opened 3 years ago
Hello @Rafcin, really sorry to hear this.
In addition to the information above, were there any additional customizations to the ES resources beyond what Amplify had generated? Were there a recent deployment?
There was no additional modification, I used the ES resource as it was. Sorry for not adding anymore information this occured late at night and I was focused on writing a script to implement all the data back to ES.
I'll log into the console and pull up all the logs for this issue.
A quick summary of the project however, ever since I created the ES instances they've always been stuck and a yellow level since creation, why I have no idea. Apart from that I've been using ES for a year now and this is the first time it's occured. The version it's currently set to is R20210426-P2, it should be the latest, the last major update was a month ago I believe.
I also noticed a small CPU spike while sitting through the logs, however it seemed normal as it does that from time to time.
Whatever information you need for this issue let me know, I'll be online as long as needed. Hopefully no one else experiences this issue 🙏.
This same exact incident happened to us early hours UTC Monday.
We reached out to AWS Elasticsearch support and it turns out that the t2.small.elasticsearch instance that Amplify auto-created when adding the @searchable
directive is CPU limited and when the instance uses too much CPU over a period of time, the instance is shut down due to CPU credits having been exhausted. If you only have one node, data will be lost, but Elasticsearch does create a new node. This is at least according to the investigation done by the AWS Elasticsearch supporter and we believe that is the right cause.
There are many concerns in relation to this. I intend to file a bug report with questions related to this soon, as the AWS Amplify support was not really helpful in clarifying some questions we had related to increasing capacity. Changing to multiple nodes and large instances, will result in some serious cost increases and it does not seem possible to configure a smaller Elasticsearch single node instance for a development environment and a larger multi-node instance for production. In addition to that, Reserved Instances could lower costs by a longer (upfront paid) commitment, but it is also unclear if it's possible to use Reserved Instances with Amplify provisioned Elasticsearch instances.
@houmark Thank you for mentioning this!! That worries me a lot now that my site will start gaining more traffic and users will be adding more data. I looked around and unfortunately had error logs disabled and couldn't find anything useful. I did notice the CPU spike during the crash so that makes sense. This kind of has me shitting my pants now that i'm deployed in production, I've had 500 something users in the last day and I mean fine I have a script now to reinsert the data (apparently I also could have just updated any attribute and it would have ran the ddbtoes lambda as well, I was just an idiot at 10am) although that's not exactly something I should have to monitor and fix in a deployed app.
That being said I think once the amplify team takes a look at this I have full confidence they will introduce a solution fairly quickly, they haven't let me down thus far.
Hello @Rafcin
We are investigating the issue. In the meantime, can you give it a try to this script here in order to recover your data.
@lazpavel Much appreciated!! I ended up writing a new script that updates the timestamp of the last time the DB item was updated and that automatically fires the lambda and reinserts it to ES, and it does that in a set of 50 at a time. On the bright side should it happen again I can just run the script, however it's not ideal, I also use a custom mapping so I have to go into Kibana each time and reinserts mapping for Geopoint and stuff like that.
If you can please keep me updated on a fix, I'm terrified it might happen again but during the hours people use the site and I really would like to implement a fix to make sure it doesn't happen again 😭.
Can you provide us cluster details (account number and domain name). Th engineers would need that to get more information. Please send send it to us via amplify-cli@amazon.com.
@lazpavel I sent over the details. Also i'm wondering how will this issue be handled in the future?
With ES for example thinking that ES won't crash again is naive if it moves to a larger instance, so i'm wondering will there be any tools added to fix an issue like this in the future. My second method to fix the data was simple just update all the values and the lambda fixes the issue. I work on my current project as a one man team and I can't always be on call to fix the issue if it happens so i'm curious will a tool be added to automatically reinsert the data should the ES instance crash?
I also have a theoretical question, say I gain a ton of users for my sight and just hypothetically I accumulate say a 1TB of data. If this issue happens again what do I do? That's 1TB of data, I can only imagine the bill for when all that data needs to be reinserted.
And one more question slightly related to this. Say I plan to shift away from Dynamo and go to Aurora. I haven't checked in a bit but can I use ES with Aurora? I don't plan to at this time but it's just nice to know.
Hi @Rafcin, thank you for the information provided and for your patience.
I communicated it further to the ES team, including your concerns. I will get back to you asap I have updates.
Thank you!
Hello @Rafcin,
The team is still investigating the root cause. In the meantime they advice to not use T2/T3.small instances as per best practices here: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp.html and https://aws.amazon.com/premiumsupport/knowledge-center/opensearch-node-crash/
Also, we have automated snapshot for data backup in order to recover the data from it. Details can be found in https://www.elastic.co/guide/en/elasticsearch/reference/7.14/snapshot-restore.html
It would be really nice to have configuration settings for each environment. Upgrading to a bigger instance and having that happen also in dev and staging feels unnecessary and expensive.. Seems I had my small instance crash as well between 09/12 and 09/13. My CPU utilization rose from max. 20% to 50% at crash time and 5 days after that.
Hi @parvusville, It is possible for you to configure the instance type per environment, we are currently working on updating the documentation. Here are the steps:
amplify env add
to create a new environment (e.g. "prod")amplify/team-provider-info.json
file and set ElasticSearchInstanceType
to the instance type that works for your application
{
"dev": {
"categories": {
"api": {
"<your-api-name>" : {
"ElasticSearchInstanceType": "t2.small.elasticsearch"
}
}
}
},
"prod": {
"categories": {
"api": {
"<your-api-name>" : {
"ElasticSearchInstanceType": "t2.medium.elasticsearch"
}
}
}
}
}
amplify push
You can check the Amazon OpenSearch Service instance types here.
This just happened to me as well. Any way to prevent this in the future. Assuming the correct solution is to update the instance type to something larger and then find a way to backfill the records?
And what CLI Version does this per environment instance type trick work on?
Building on what @julien-tamade said. If I move to a larger EC2 will I have to populate all my data again? Also I was under the impression that the EC2 instance would automatically scale based on use, is this not the case?
Hi @Rafcin, @julien-tamade
You can change the ElasticSearchInstanceType
without data-loss, so you don't need to repopulate the data. The ElasticSearchInstanceType
does not automatically change.
https://aws.amazon.com/premiumsupport/knowledge-center/opensearch-scale-up/
@lazpavel
Thanks for the response. I tried setting the instance type in the team-providers file but when I went to push it detected no changes so i did a push with --force, but there's no changes in opensearch after. How can I get it to actually create a new search instance?
@lazpavel Happy to see some focus on this. As mentioned, we have also had downtime and data loss recently, so this may be affecting more people than it used to.
A few questions and improvement requests in relation to this:
ElasticSearchInstanceCount
in parameters.json (according to the official docs) will affect all environments, can this also be changed in team-provider-info.json per environment, or is only the instance type you can change there? We are considering 1 or 2 instances for our dev environment and 2 or 3 for our production environment and it would be different instance types to keep costs reasonable.Hi @julien-tamade, can you please send the following to amplify-cli@amazon.com:
amplify env list
amplify/team-provider-info.json
amplify push --force
Thank you
Hi @houmark and @GeorgeBellTMH,
Thank you for your feedback, we are looking into it and will get back to you after we have updates
@lazpavel
email sent. thanks!
update:
The trick to specify different instance types in team-provider-info.json worked. I had provided an incorrect api name. Using amplify push --force worked
@julien-tamade thanks for clarifying! Is the correct API name the one the folder is named in amplify/backend/api/api-name
?
Is it also the one you see in the summary of amplify status
?
@lazpavel It's happening again...
I have a production instance on a t2.medium and my dev on t2.small. The medium seems fine but the t2.small is shitting the bed and it lost all its data. Are there any updates on a fix for this? You mentioned we should avoid the small instances, and as much as I would like that I also can't afford to run 2 medium instances with opensearch...
I had the same issue with the t2.small several times but I don't think that a fix will come for that since they are burstable. I'd use the medium instance (for 60 € / month 😤) everywhere where the data should persist.
This thread was invaluable in recovering from an surprise ES/OS purge. Production instance upgraded to medium, and backfill python script saved the day. Thanks to all who posted comments above.
Docs should have a warning setup as well. I think the CLI gives you a warning that a small may have performance issues but the loss of data should be added to that warning as well.
On Wed, Nov 24, 2021, 2:32 PM Warren McQuinn @.***> wrote:
This thread was invaluable in recovering from an surprise ES/OS purge. Production instance upgraded to medium, and backfill python script saved the day. Thanks to all who came before in the comments.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <aws-amplify/amplify-category-api#165>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQCG65JDGZDURCHIUJZS6DUNVRWRANCNFSM5DT46DSQ .
@lazpavel update. I switched to a medium instance and I don't have too many records (< 200) and the elastic search instance killed itself again.
@warrenmcquinn Medium is also unstable it seems..
@Rafcin our medium instance just lost all data too, so it doesn't seem to be a cure-all.
@warrenmcquinn
That's a bummer, I'm so sorry :(
I'm experimenting with a solutions for this, it's driving me insane as well. I was thinking a lambda could be setup to check if the data is lost every X hours although that's expensive and stupid.
The OpenSearch issue seems to have stemmed from the last & forked version of ES AWS used. Before that it wasn't an issue. So the bug came from Elastic.
I've been thinking about this a lot, OS is a core feature of my site, and when OS works my search goes down, certain pages go down and it's a nightmare. I can't figure a good way to setup A. A fallback, B. Automatically reindex the data.
There's no cheap efficient solution to rei dex the data, in all cases you'd need a lambda or something that triggers when data is lost, but even then you either need to update the DB to reindex or run the reindex script and both are expensive to run for a large dataset.
Also OpenSearch as far as I'm aware won't scale up as your data scales, so if you launch a product that has a bunch of data it will eventually run into the same crash issue unless you upgrade to a larger container.
Edit I guess you could have a lambda to check if the data is there and then use the snapshot feature to restore it.
@warrenmcquinn Maybe the solution to the problem at least until the issue is resolved would be to add a new lambda that gets created with the searchable-transformer stuff to handle automatic snapshots and restorations.
I think a viable although most likely inefficient temp solution would be to have a lambda that queries OS just to see if the instance has data or is up and if not restore a snapshot. If the data is there and it's working fine then just create a snapshot.
Yep, lost data on medium too. :/
Any updates on a robust solution to this issue? My elastic search instance just lost all of its data, and my mobile app is heavily reliant on elastic search functioning properly. As others have mentioned earlier, having to reindex all of my data is quite inconvenient for both me and my users.
Until AWS devs come up with a better solution, the only safe solution to avoid full loss and reindexing is by upgrading your instances to c4.large.elasticsearch
and increasing the instance count to 2. Yes, this setup is $220+/mo for one environment, but you can at least configure Amplify to use a different instance type and count for a development environment to save on cost but also accept that the search instance can get shut down for that environment then. That's expensive but we did this months back and have had no issues since. We do feel we are overscaled with this setup but even t2.medium.elasticsearch
can get shut down due to too much CPU usage over a short period resulting in data loss and a new instance being automatically created but it then needs to reindex all data.
We've tracked that there seems to be some event happening at random times where the instance will increase CPU usage for a longer period (maybe instance upgrade or something else unrelated to normal search usage) and this event can be the trigger of too much CPU usage which then shuts down the instance.
Over the past several months I've been losing OpenSearch data. About a month ago, all my data was lost, and I did not notice until just today.
Automated snapshots only go 14 days out for my version - so I don't have a backup. Is there any way to recover this data?
Is there any recommended workaround for this? Do newer versions of OpenSearch have the same problem?
I'm consider switching to Dynamo, but I don't understand what's causing these issue with OpenSearch, and I'm skeptical of related products.
@josefaidt - what is the recommended solution from the Amplify team? 🙏🏼
It's not really ideal to have your data dropped and not automatically re-indexed.
It would also be beneficial to update the Amplify docs on this issue.
@kylekirkby the solution is to really just upgrade your instance or use the OpenSearch/Elastic backup and restore feature. You don't really have many other options. Upgrading your instance and setting clusters up is the ideal solution although it will run you more.
I have had this issue since 2020 and it's unbelievable that AWS is not able to resolve this after 2+ years. DynamoDB is pretty bad at search and using OpenSearch is the only way out. I'm switching to MongoDB Atlas (it supports native text search) entirely for the alternative. (it's way cheaper to use Atlas than using even 2 * t2 opensearch instance for prod and dev.
@alharris-at Can you provide more information about the potential fix for this issue? It's making @searchable basically useless for non-enterprise projects which don't have a significant amount of traffic to use multiple large instances. It's also hurting the adoption of DynamoDB since most of the projects will require some sort of search feature. I see you changed the label from P1 to P3. I was very surprised about the deprioritization for such a critical issue.
So I wonder if someone could develop a lambda that did some sort of comparison to validate all data matches...some ideas:
Essentially this would be an open search doctor lambda...repairing stuff in the background when there are problems...
@GeorgeBellTMH I like your idea. However, it will still have availability issues during those resync processes. Imagine search will constantly down for the project. People choose dynamoDB because of the availability and on-demand pricing. Now, AWS is essentially telling people: hey your search will have terrible availability and if you wanna solve that, get the largest instance and pay us $$$$$. It defeats the whole purpose of dynamoDB
Yes, I think for a full table resync you'd have to do something where you targeted a temporary table and then when it was done swapped over to using it...maybe there could be a scoring system or threshold when a full sync happens (some obviously simple ones like row count, or ensuring the number of inserts/deleted/updates match....)
I guess the other solution is figuring out some of the problems with the current sync and resolving them...not sure if the current implementation is using queues, or what causes the data loss...I assume it's something to do with availability of all the services, and recovering well when they come back online....
Is there really a solution though? Even if you're using a t3 small open search /elastic search instance, OS/ES is fundamentally flawed, They simply cannot run on small instances. Also, it seems anyone that decides to run instances like that, it's recommended normally that they have these backup nodes in case s*** like this happens. Unfortunately, this does make adopting DynamoDB for a smaller projects quite difficult. For smaller projects, it's probably worth trying something like MariaDB or that Apache database Discord uses.
At this point, open search really only makes sense for large scale projects. Something like Airbnb. Airbnb, switched over two open search but at the same time Airbnb is Airbnb and the amount of servers they have spun out for that definitely ensures they won't run into issues although I bet it costs them a s*** ton.
Can't agree more @Rafcin! May the force be with Airbnb so that they won't lose all of their search results when demand surges!
Just started to use searchable in my project and agree with all the sentiments here.
For all these reasons I think usage of @searchable
directive should be minimized in production. The whole setup just feels too brittle.
@hisham, I think you hit all the big points.
This is tough, if you use Dynamodb, you only have OpenSearch as a simple existing solution. You don't have many options. Currently, it's recommended to use a t3.m
or above for prod. In my prod setup, I use t3.m
with 2
nodes. The problem with t3.small
, even with a few documents, you get a memory issue, and the instance crashes, and then you have to reboot and repropagate the data. The problem with scaling up is cost, it's just so damn expensive. The big issue is this isn't an Amplify issue; it's the OpenSearch team's issue, and if we wanna be specific, it's ElasticSearches fault in the first place. It runs like shit because they never intended it to be used in such small instances. It works on sites like Airbnb because they have those large instances set up, and so do most prominent businesses. Hopefully, the OpenSearch devs at AWS will look into this and make it more accessible for smaller devs.
Regarding the team provider file, I thought they fixed it so you could update the settings from the file, but it's probably better to do it through the console anyway; Amplify isn't going to manage anything when it comes to OpenSearch scaling. (I think...)
I don't think you can do much about the speed of upgrades and settings changes. That's just how it is. Updates take 10-20min as well.
I noticed an announcement from Amazon about a serverless open search(https://aws.amazon.com/about-aws/whats-new/2022/11/announcing-amazon-opensearch-serverless-preview/).
Is there a plan to implement @Searchable serverless option?
@csvalavan I can't speak for the Amplify team but I brought this up and they said they're speaking about this internally. There are a lot of considerations that need to be made in terms of pricing.
Pricing as of now is:
OpenSearch Compute Unit (OCU) - Indexing
$0.24 per OCU per hour
OpenSearch Compute Unit (OCU) - Search and Query $0.24 per OCU per hour
Managed Storage $0.024 per GB / month
And a standard t3 for comparison is: t3.medium.search 2 4 EBS Only $0.073/hr
Before opening, please confirm:
How did you install the Amplify CLI?
No response
If applicable, what version of Node.js are you using?
No response
Amplify CLI Version
5.4.0
What operating system are you using?
Ubuntu 20
Amplify Categories
api
Amplify Commands
Not applicable
Describe the bug
Quick and simple issue. It's 10pm and my production ES service created by Amplify decides that it want's to spike to red health and it crashes randomly, I hadn't touched anything all day for my project, the most I did was modify some css. ES takes a shit for almost 30 minutes and finally once it cleaned itself up and the 502 error went away I log into Kibana and Kibana funny enough tells me all my streamed data from DynamoDB is gone.
I am now left with no explanation as to why ES crashed and why Kibana lost all data.
I went into cloudwatch, I scoured all of AWS and I couldn't find a problem. The CPU usage spiked a bit with high traffic but that's it.
Also how should I go about putting back all my data? I have from now till 6am or however long my brain has.
Expected behavior
ES should never crash.
Reproduction steps
NA
GraphQL schema(s)
NA
Log output
NA
Additional information
No response