Save-system needs changes to prevent corruptions

johnkslg commented 4 years ago

Spent some time investigating how the save corruption could happen as it did on the playtest server in physics, and I was able to repro problems pretty reliably:

Open playtest world https://download.godsofwar.de/eco/dbexceptionbefore.zip
Fly to zoning office at (685,181)
Type /claim so you can edit it
Open the second district map draft, click submit
You'll see it go to election. Immediately close the server by clicking X in the console window
Run the server again and reconnect with the client. Usually youll find the election did not save. Sometimes youll find the election did save, but the district it refers to is missing.

I believe what's happening is that the server is being halted after 5 seconds but not completing the full save, allowing some DBs to save, but not others.

I looked into how the save could happen non-atomically. The storages are each persisted in a loop:

That persist call performs the serialization and queues the data, and triggers a worker thread to add it to the zip:

Problem is this zip-writing will start happening while the other storages are still being persisted, and when the program gets halted it may have ended up saving some but not all storages, resulting in a non-atomic save.

Some things that could combat this:

Dont trigger the worker thread to start writing to the zip until all pendingStorages's in PersistenceManager have been serialized and queued. This will give a better atomicity to the operation (including in cases of non-graceful shutdowns).

We also need to prevent the 5 second delay that is halting in the middle of save operations. I looked into that and found a couple promising things: From https://docs.microsoft.com/en-us/aspnet/core/fundamentals/host/hosted-services?view=aspnetcore-3.1&tabs=visual-studio


Any remaining background operations that the app is performing should be aborted.
Any methods called in StopAsync should return promptly.
However, tasks **aren't abandoned** after cancellation is requested—the caller awaits all tasks to complete.

This differs from what I was seeing, where tasks are indeed abandoned. Perhaps something specific we're doing with the cancellation token? I couldn't see anything that would force-quit once that cancellation token was triggered, but I dont know all the workings so its likely there is something there. If we can't find a way to make that extend past 5 seconds, we can change the server app to a different form that allows proper shutdown handling (console apps maybe are hardcoded to force it? But plenty of apps allow you to save on quit, and it can take much longer than 5 seconds).

This corruption happened when the playtest server did its normal daily restart at 3am, attempting to shutdown gracefully and restart, and coming back with a corruption. It manifests on the playtest server because the save takes a long time, allowing these race conditions to show up (it had thousands of objects in the pending saves, possibly objects using fuel or workorders that trigger lots of changes or possibly a bug).

With these two changes:

Improving atomicity of saving
Fixing graceful shutdown to allow save completion

we should all but remove save corruptions, even in the case of non-graceful shutdowns.

mirasrael commented 4 years ago

@johnkslg

If we can't find a way to make that extend past 5 seconds, we can change the server app to a different form that allows proper shutdown handling (console apps maybe are hardcoded to force it? But plenty of apps allow you to save on quit, and it can take much longer than 5 seconds).

Yes it is for Console apps. If we want to handle it as Windows App then you shouldn't have the console and handle Close button click. In this case you have much more time and can even ignore WM_CLOSE event. For official severs we have graceful shutdown with Ctrl + C which not limited to 5 seconds, but probably timeout should be extended if save takes too long (@D3nnis3n).

This differs from what I was seeing, where tasks are indeed abandoned. Perhaps something specific we're doing with the cancellation token?

No, tasks are not abandoned. They waits until completion. You can check this easily if you make graceful shutdown (with Ctrl + C) or "exit". But if you force quit app then it just terminates process in 5 or less seconds.

Dont trigger the worker thread to start writing to the zip until all pendingStorages's in PersistenceManager have been serialized and queued. This will give a better atomicity to the operation (including in cases of non-graceful shutdowns).

It doesn't really makes any difference. We already discussed this.

Case 1. Wait until all pendingStorages queued. We waited, started save, new item added in next nanosecond. Same issue.
Case 2. Wait until all pendingStorages queued with idle delay (i.e. 1 second). Two scenarios: (1) save never happens or happens very rare, because we can't get long enough idle interval if delay is too long or (2) we have next item in nanosecond after idle delay expired - no difference if delay is short.

Only viable solution with current save system is to guaranteed save order where you can't save object without dependency (i.e. how it works for work orders). So you should always save (mark dirty) a dependency first and only then save a referrer. Also in this case you may need to detect and remove orphan dependencies if they doesn't make sense without parent entity.

johnkslg commented 4 years ago

Are the game threads being halted on shutdown? That would matter for being able to get everything saved out. That way it should generate no more pending saves and allow us to iterate on the queue till empty.

I'll see if I can still repro the corruption with a ctrl-c exit.

johnkslg commented 4 years ago

I also think that yes, in theory the saves could overlap in such a way that youd never be able to get off a clean save, but in practice we will likely always be able to.

I also would like to avoid the solution of managing dirty-marking-order, because thats a game of whack a mole: no way to detect problems till a rare failure occurs. I would like to find a systemic solution.

johnkslg commented 4 years ago

Ill make a separate issue for removing console app, and leave this one assigned to me for a systemic solution investigation.

mirasrael commented 4 years ago

The only correct solution here is a true transaction system.

mirasrael commented 4 years ago

btw, issue happened with graceful shutdown. So probably it is an issue with how CivicAction saved/deserialized.

johnkslg commented 4 years ago

Got some changes coming in that will treat this, closing.

StrangeLoopGames / EcoIssues

Save-system needs changes to prevent corruptions #15304