chaostoolkit-incubator / chaostoolkit-azure

Chaos Toolkit Extension for Azure
https://chaostoolkit.org/
Apache License 2.0
22 stars 28 forks source link

Databases actions #72

Open xpdable opened 4 years ago

xpdable commented 4 years ago

We'd like to have more experiment on database level, like restart its instance. Now we are using below service in red block. We've encountered the real-world restarts, especially MySQL many times and this cause we have to work on weekeeeeend :-( image

russmiles commented 4 years ago

:( my heart goes out to you on that! Weekend loss is bad for everyone.

Will help if I can with the actions.

buderre commented 4 years ago

@xpdable Can you provide information whether you have a self-managed SQL server or use the database shared model in Azure?

There are some operations that can be activated on MS Azure database like pausing - https://docs.microsoft.com/en-us/rest/api/sql/databases/pause. Since there are differences in the rented model (self-managed or pay-as-you-go) a well chaos action needs to be prepared here :)

xpdable commented 4 years ago

@buderre as talked offline, briefly share a real case of database Once Microsoft has to force patch a vulnerability, and our database instance of Azure Database for MySQL are restarted without any notification. Later, we learnt that all users in Azure China Cloud are affected by this restart. Some service went down because the connection lost from databases. All end-users of the service is affected. The service team firstly shouted out, and Azure team then chased to Microsoft with the incident. There was nothing to do except waiting the restarted done and it lasted no more than one hour before the service back to normal. The only good news was that we did not lost data

buderre commented 4 years ago

@xpdable This is more than sufficient information for me.

@botobako @mkaszub I also have a suggestion for putting the scenario "database connection loss" to a chaos action. In your case the server seems to restart itself. The MS Azure REST API for MySQL does not offer a "restart server" action. I don't think that we need it. Instead let's introduce a Azure firewall rule that blocks the connection to the MySQL database for a specified time span. The advantage is that the database remains untouched and we can test the same scenario in a safe way. What do you guys think?

Lawouach commented 4 years ago

Just chiming in :)

I definitely like your idea of impacting the network and I think that's quite clever to rely on the infra to do that, I wouldn't have thought of setting a firewall rule.

Otherwise, in some other areas, you can sometime simply route the network to /dev/null via an intermediate proxy. This means being able to add something in the infra which may not be allowed by your infra team.