Open xpdable opened 4 years ago
:( my heart goes out to you on that! Weekend loss is bad for everyone.
Will help if I can with the actions.
@xpdable Can you provide information whether you have a self-managed SQL server or use the database shared model in Azure?
There are some operations that can be activated on MS Azure database like pausing - https://docs.microsoft.com/en-us/rest/api/sql/databases/pause. Since there are differences in the rented model (self-managed or pay-as-you-go) a well chaos action needs to be prepared here :)
@buderre as talked offline, briefly share a real
case of database
Once Microsoft has to force patch a vulnerability, and our database instance of Azure Database for MySQL
are restarted without any notification. Later, we learnt that all users in Azure China Cloud are affected by this restart.
Some service went down because the connection lost from databases. All end-users of the service is affected. The service team firstly shouted out, and Azure team then chased to Microsoft with the incident.
There was nothing to do except waiting the restarted done and it lasted no more than one hour before the service back to normal. The only good news
was that we did not lost data
@xpdable This is more than sufficient information for me.
@botobako @mkaszub I also have a suggestion for putting the scenario "database connection loss" to a chaos action. In your case the server seems to restart itself. The MS Azure REST API for MySQL does not offer a "restart server" action. I don't think that we need it. Instead let's introduce a Azure firewall rule that blocks the connection to the MySQL database for a specified time span. The advantage is that the database remains untouched and we can test the same scenario in a safe way. What do you guys think?
Just chiming in :)
I definitely like your idea of impacting the network and I think that's quite clever to rely on the infra to do that, I wouldn't have thought of setting a firewall rule.
Otherwise, in some other areas, you can sometime simply route the network to /dev/null via an intermediate proxy. This means being able to add something in the infra which may not be allowed by your infra team.
We'd like to have more experiment on database level, like restart its instance. Now we are using below service in red block. We've encountered the real-world restarts, especially MySQL many times and this cause we have to work on weekeeeeend :-(