HalcyonGrid / halcyon

Halcyon 3d virtual reality world simulator
BSD 3-Clause "New" or "Revised" License
20 stars 20 forks source link

Asynchronous OAR saves lead to incomplete OAR archives #74

Open appurist opened 5 years ago

appurist commented 5 years ago

This is a particularly nasty one, if OAR files are to be considered backups. And in the case of InWorldz OAR files captured at grid shutdown time back in July, it's worse, as these are the only copies.

The common procedure for taking a copy of a region is to issue a save oar command from the console, wait for it to complete, then copy the output OAR file elsewhere. Unfortunately, if you do not wait some amount of time after the OAR save has completed, the resulting file may be incomplete. This is due to the fact that the process returns to the console and new commands can be typed before the OAR save has completed.

We had several incomplete OAR files from InWorldz, and I assumed this was some form of network or file transfer problem long after the OARs were captured, but I no longer believe this to be the case. I have witnessed on my own machine a difference in file sizes after copying a resulting OAR file (after the console command has completed and returned to allow new commands). If I refreshed the bin folder repeatedly, I could see the size of the OAR file increasing; this is after both the console command completed and even after my manual copy of the OAR file from the bin folder to elsewhere had completed.

This means to take reliable OAR files, you must not only wait for the command to complete, but repeatedly refresh the folder contents waiting for the file size to remain constant, and even then, wait a bit longer to become confident that it has in fact completed. There is no deterministic way to know if an OAR file save is finished.

appurist commented 5 years ago

I thought at first this was due to asynchronous writes to the file (e.g. Windows asynchronous write completion), but in all cases the resulting (incomplete) OAR files contained only files in the assets/ subfolder in the OAR contents.

I'm speculating now, but I believe this is a clue. At some point in the past, Tranq added asynchronous gathering of the assets. There is a class called AssetsArchiver that is fed into a new AssetsRequest() (ArchiveWriteRequestPreparation.cs line 269 in my load-filtered branch. It passes ReceivedAllAssets as a callback to run when all assets are loaded and I think all of that code (which dumps the actual contents of the region) runs asynchonously from the method that was blocking the console. Thus, if you wait for it to return, and then immediately copy the resulting OAR file somewhere, it won't (or is unlikely to) have any of the actual objects in the region present in the OAR.

appurist commented 5 years ago

(edited): Okay it's not as bad as I thought a few minutes ago.

AssetsRequest.cs has as it's primary method Execute() which uses AssetRequestCallback (the next method below Execute()) as it's callback.

Execute() has a foreach to fetch all the requested assets. When the total number of requests completed matches the expected, it fires up a new (single) thread. But that makes the callback method passed to it run asynchronously. Except for the control file and assets, all of the oar saving code runs in that callback. So all of the objects etc are saved after the command itself has returned control back to the console.

appurist commented 5 years ago

Some good news, there is a bit of a workaround to the non-deterministic nature of the save oar command.

The message: [ARCHIVER]: Wrote out Halcyon archive forRegionName is displayed on the console, from the async thread, once the process is complete. This happens after a Flush call and a Close call on the tar stream,

So it should be safe to ignore the fact that control has returned to the console and wait for that message, then copy the resulting OAR.

appurist commented 5 years ago

There is also a OnOarFileSaved event triggered once the process is complete, which the console thread could wait for before returning. It is currently ignored and unused. This is probably the actual bug here.

sonjamichelle commented 5 years ago

Are any of the OAR commands functional in the master branch? I need to archive a region.

kf6kjg commented 5 years ago

They should all be functional and usable. Especially for the original goal of archival and recovery of a region on the same grid - assuming you give it time as noted in this issue report. Exporting a region for import on another grid and export-to-OpenSim were other issues that should now be resolved in master.

sonjamichelle commented 5 years ago

Is there a timeout to the process? I'm trying to make an oar of my Market Region. I've tried several times. Each time the file gets bigger, but it still is missing the objects, terrain and settings folders. When I untar the archive I get and EOF error. A save oar of my welcome region succeeds. It is of course considerably less full of objects.

kf6kjg commented 5 years ago

As noted above, you have to wait. As the code currently stands you'll have no notification when the export is complete: when the terminal comes back and you can enter other commands it doesn't mean the export is done, it just means the export has been started and is ongoing.

Right now the best thing you can do is start the OAR export and go get a warm drink, have a meal, or go to bed and let it run overnight. Then look at the resulting file size WITHOUT OPENING IT for a few minutes to see if it's still changing size: if it is, leave everything alone as it's still working. If there's no file size change for several minutes it's possible that it's finally done.

That's the problem this issue is reporting: there's currently no way to know for certain when an OAR is finished exporting.

sonjamichelle commented 5 years ago

Sounds Good. I do an over night run of the command. Let ya know what happens.

appurist commented 5 years ago

It typically only takes a few seconds (like 10 or 20 seconds) to complete after control returns to the console. I usually wait about 1 minute and then start refreshing file size to see it has settled down and isn't changing. Then I'm good.

Also, and this mostly repeats what @kf6kjg said above with different wording, but the changes in the OAR branch were merged into the master branch a couple of days ago, so it has everything. These haven't been released yet, but if you build from master, it should have all the fixes. Most of the changes were on the loading side, especially in the load filtered command which was new and mostly for the InWorldz export case, but there are also critical fixes in the save oar command in terms of gathering the IDs of the assets to include in the OAR, if you intend to restore the OAR file elsewhere (different asset server). If so, be sure to append the 1 at the end of the command to indicate to include assets, e.g. save oar region.oar 1.

sonjamichelle commented 5 years ago

Something is up with at region. I only gets to 850MB or so then stops. All my other regions save and load properly. I'm going to try a few things on the region to see what's killing it.

sonjamichelle commented 5 years ago

I'm currently running 0.9.41.7021 On all my regions. I see that the current master branch builds 0.9.41.7027. I loaded that to a couple of regions. Will load it to my market region and retry the save command.

appurist commented 5 years ago

Nothing in the halcyon.log file to indicate a problem? Are you using save oar region.oar 1?

The build number at the end of the version number isn't really a version number. It's major.minor.revision.datestamp so that 7021 is just the day it was built. If you do a clean and a rebuild, it will have a new datestamp at the end of the version. 7027 is just 6 days newer (if a full rebuild was done). There hasn't been a new release since 0.9.41 yet (I think we're do now with the huge number of recent major updates) so they will all indicate that but newer builds from master have much more updated code.

sonjamichelle commented 5 years ago

I'll run it again and tail the log. See what comes up.

sonjamichelle commented 5 years ago

I don't see anything negative in the logs. But then again, I'm not 100% sure what I'm looking for. Anything tagged [ARCHIVER] seems normal.

Here is the log file if you want to look. Halcyon.log

appurist commented 5 years ago

The very beginning of the log is very interesting in that it indicates an error we could cleanup with that deserialization of the objects. If it can't find the asset, it's still trying to add the part with a null key (the first error and traceback at the start of the log): [SERIALIZER]: Deserialization of xml failed and An element with the same key but a different value already exists. Key: 00000000-0000-0000-0000-000000000000 That's a missing error check at SceneObjectSerializer.cs on line 163. But other than the log messages, this is basically normal and for the purposes of your problem reported, I don't think it's relevant in any way.

[SERIALIZER]: Deserialization of xml failed with System.Xml.XmlException: Data at the root level is invalid. That one I think is normal in that it reports this when it tries to interpret an asset as a normal OpenSim asset (XML) when in fact it is a Thoosa (modern Halycon) binary asset. There is code to check if it is a Thoosa asset (that's one of the recent fixes) and if that fails to try it it as XML, and the XML test is (still) noisy (as we can see from the log), and assumes there are no alternatives. I also didn't make the full set of changes to handle Thoosa assets in all cases, I just completed all the cases I could find other than the case of coalesced inventory (Contents) items (comprised of multiple objects). So if you're encountering a coalesced object that has one or more Thoosa objects in the combined Contents item, it may also fail to deserialize it and report a bunch of XML decode errors on the Thoosa binary asset. That's a limitation of the current code, although far less limited than it was a few months ago. Basically the migration to Thoosa assets years ago was very incomplete. I described this in more detail here.

[SERIALIZER]: Deserialization of xml failed with System.Xml.XmlException: Data at the root level is invalid. That is a cascade error of again trying to interpret a Thoosa binary asset as XML (no chance of that succeeding).

All of these cases above are only relevant when moving a OAR file from one grid/installation/asset server to an other. In that case, assets must be included in the OAR, the command is issued with the trailing 1 parameter, and there is an attempt by the code to find all referenced assets and include them in the OAR file. The code to do that was incomplete, especially in the area of coalesced Contents items, although there is partial support for coalesced objects, the problem is the combination of coalesced objects and Thoosa assets. So this only matters in the OARs including assets, not "backup" OARs for regions on a grid. This is why it went undetected. It was only in my more complete tests for full-fidelity OAR exports with assets and reloading on the OpenSim side that I found many of these issues.

Many are addressed in the current master but there is at least one case that is not (it reports on the console that a coalesced object is skipped) and apparently from your log it seems there is another case (perhaps a mix of XML and Thoosa assets in a coalesced object).

After that, it's busy storing the assets in the OAR file when suddenly at 14:45:00 it reports that the central Grid service was unreachable:

2019-03-28 14:44:49,074 [STP SmartThreadPool Thread #10] INFO  OpenSim.Region.CoreModules.World.Archiver.AssetsArchiver - [ARCHIVER]: Added 17800 assets to archive
2019-03-28 14:45:00,913 [103] ERROR OpenSim.Region.Communications.OGS1.OGS1GridServices - [OGS1 GRID SERVICES]: Region lookup failed for: 1099511628031744 - Is the GridServer down?System.Net.WebException: The operation has timed out
2019-03-28 14:45:00,913 [103] WARN  OpenSim.Region.CoreModules.ServiceConnectors.Interregion.RESTInterregionComms - [REST COMMS]: Region not found 1099511628031744
2019-03-28 14:45:00,913 [50] WARN  OpenSim.Region.CoreModules.ServiceConnectors.Interregion.RESTInterregionComms - [REST COMMS]: Region not found 1098412116404736
2019-03-28 14:45:00,929 [108] WARN  OpenSim.Region.CoreModules.ServiceConnectors.Interregion.RESTInterregionComms - [REST COMMS]: Region not found 1098412116404480
2019-03-28 14:45:00,913 [75] WARN  OpenSim.Region.CoreModules.ServiceConnectors.Interregion.RESTInterregionComms - [REST COMMS]: Region not found 1098412116404224
2019-03-28 14:45:00,929 [102] WARN  OpenSim.Region.CoreModules.ServiceConnectors.Interregion.RESTInterregionComms - [REST COMMS]: Region not found 1098412116403968
2019-03-28 14:45:00,929 [79] WARN  OpenSim.Region.CoreModules.ServiceConnectors.Interregion.RESTInterregionComms - [REST COMMS]: Region not found 1099511628032768

Did you have other neighbor regions up and running, took them and the central Grid service down?

Then nothing related to the OAR save.

A [SCENE]: Bulk storing backup in the regions runs from a timer later to persist the changed objects, then at 14:51 there's a slow Phlox timeslice (probably indicating the machine itself was under CPU load in some other processes). Then another persistence backup runs, and at 15:01 you log out.

It seems all was running fine until it lost contact with the grid service, and perhaps other regions (or they were shut down) or perhaps this machine lost its network connection. Something of that nature.

I think losing the connection to the Grid server is the primary clue here. It probably lost contact with the asset server too so it couldn't write any more assets. Or if you're saving these to a network file system, it may have lost it's ability to write more to the OAR file.

appurist commented 5 years ago

Okay I thought that was the end of the OAR save but I see it then continues: 2019-03-28 15:03:34,963 [ConsoleThread] INFO OpenSim.Region.CoreModules.World.Archiver.ArchiverModule - [ARCHIVER]: Writing archive for region Lunar to Lunar.oar

I think this is a new attempt at an OAR save, but the underlying problem of the Grid server being unavailable does not look addressed.

Also I don't see the previous OAR save being completed (although it's a lot of text so I may have missed it) so it almost seems like the previous one is still in progress, which could certainly affect a new OAR save attempt.

If you have a failure to save an OAR, with it looking like it is just incomplete, I would strongly suggest that you shut down the region and restart it. There's a good chance that in this case it won't restart because it cannot communicate with the central Grid service, and then that will highlight a problem that needs to be addressed first.

If it does start, it would seem that the Grid service problem was temporary, and that restarting the server worked as a retry, or that the previous failure to communicate was somehow "fake", like if there was a firewall or something that had a timeout and closed the connection due to inactivity while the asset dump ran. Note that it can take a long time to go through the assets if there is a lot. I made sure to do my IW export OARs from an installation where both the asset server and region ran on an SSD drive so that I wasn't limited by disk throughput, especially due to the asset storage.

But regardless, I think you need to get past the Grid server connectivity problem and also restart the region if it comes back to the command prompt and you do not see the OAR complete and then the size of the OAR file stabilize to a constant size.

The end of the OAR process should look like this:

[ARCHIVER]: Received 6670 of 6800 assets requested
[ARCHIVER]: Added region settings to archive.
[ARCHIVER]: Added terrain information to archive.
[ARCHIVER]: Added scene objects to archive.
[ARCHIVER]: Wrote out Halcyon archive for RegionName

If you dont' see Wrote out Halcyon archive for... then it has not completed yet. Once you do, check the file size of the OAR and await it to settle down.

appurist commented 5 years ago

Okay interesting, for the 5 ending lines above, I don't see the first one in your log anywhere.

Edited: So if that first Received line above does not appear, it means that all that assets have been not yet been fetched from he asset server. It's still (very slowly, or stuck) fetching assets.

This could be as simple as the asset fetches taking a long time. Are you using asset storage via a WHIP asset server? Is it local or on another machine?

It's also possible from the log messages that it may be buffering these assets, which introduces the possibility that Halcyon could need a lot of memory, which introduces the possibility that the process is getting slower and slower due to virtual memory (pagefile) usage, and "thrashing" due to memory swapping.

I'd just check the Performance tab in Task Manager to see if your memory usage is high, or CPU usage. The Details tab would show the process memory usage. If you're running this one a "tight resource" machine, you may need to expand the memory available to save this region.

appurist commented 5 years ago

By the way, this would have been better raised as a new issue. It's clearly a separate problem if your OAR save does not complete at all. This is unrelated to the asynchronous notification issue I raised above.

appurist commented 5 years ago

I just realized the log has many server runs so of course there were restarts in there too. So I checked one of the OAR saves to see if a restart interrupted it, and found this (I removed the less useful info after the timestamp, to make the lines shorter):

2019-03-28 21:04:37,493 [SCENE]: Stopping Physics
2019-03-28 21:04:37,775 [SHUTDOWN]: Disposing asset cache and services
2019-03-28 21:04:37,791 [WHIP.AssetClient]: Failure fetching asset 4a1ea71a-e17b-459d-9cea-b0dece17fe73
InWorldz.Whip.Client.AssetServerError: A blocking operation was interrupted by a call to WSACancelBlockingCall ---> System.Net.Sockets.SocketException: A blocking operation was interrupted
... (several times)
2019-03-28 21:04:37,821 [STP SmartThreadPool Thread #8] DEBUG OpenSim.Region.CoreModules.World.Archiver.AssetsRequest - [ARCHIVER]: Successfully received 17845 assets and notification of 34 missing assets
2019-03-28 21:04:37,821 [SHUTDOWN]: Shutdown processing on main thread complete.  Exiting...

So it was definitely stuck fetching assets. However, it seems to have completed almost all of them. That seems unusual, given that it was run for a while and then interrupted at 99.81% complete. It would be difficult to time that.

2019-03-28 21:04:37,806 [ARCHIVER]: Successfully received 17845 assets and notification of 34 missing assets
...
2019-03-28 21:04:37,821 [ARCHIVER]: Successfully received 17845 assets and notification of 34 

Mostly for @kf6kjg ... how did this message come out twice... it's like there were two OAR backups running for this first one. It's the same time second, just a 15ms later. 2019-03-28 21:04:37,821 [SHUTDOWN]: Shutdown processing on main thread complete. Exiting... Those two completions above were at shutdown time. (Server restarted.)

...
2019-03-28 22:19:27,347 [ARCHIVER]: Successfully received 17845 assets and notification of 34 missing assets
2019-03-28 22:19:27,347 [SHUTDOWN]: Shutdown processing on main thread complete.  Exiting...

and again

2019-03-29 01:58:05,343 [SHUTDOWN]: Shutdown processing on main thread complete.  Exiting...
2019-03-29 01:58:05,343 [ARCHIVER]: Successfully received 17845 assets and notification of 34 missing assets

It's always the same count of assets. Now, the two messages completed in the other order (due to multithreading), but it seems like the only way to get the asset fetch to complete is to shut down the server causing outstanding fetches to be canceled.

The "missing" count here doesn't indicate fetch failures: those are normally successful fetches returning a null (e.g. asset not found). But in this case, some number of the 34 are probably cases where the operation was interrupted and canceled, thus causing it to return a null and be added to the count that reached 34. 2019-03-28 14:41:57,850 [ARCHIVER]: AssetsRequest executed looking for 17879 assets

17845+34 = 17879 so there are no incomplete assets here, but some of the 34 are probably canceled requests.

So before the server restarts, there are asset fetch errors due to cancellation of the requests (for shutdown). There are two that report A blocking operation was interrupted: [WHIP.AssetClient]: Failure fetching asset 4a1ea71a-e17b-459d-9cea-b0dece17fe73 [WHIP.AssetClient]: Failure fetching asset 4b00fae8-d69f-4881-947f-73f368690a53 So there are exactly two of the 34 missing ones are holding up the process, and we know their IDs. I checked the log for any other occurrence but they are not seen elsewhere other than the asset cancellation error message.

Sonja, if you are running WHIP asset server, can you check the WHIP console for errors?

sonjamichelle commented 5 years ago

I have WHIP running as a background process on an kUbuntu 18.10 machine. So I don't have access to the console.

This is how I call WHIP: With a systemd service file at /etc/systemd/system.whip.service

[Unit]
Description=WHIP Halcyon Asset Server Service
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
Restart=always
RestartSec=5
User=whip
WorkingDirectory=/etc/whip
ExecStart=/etc/whip/whip

[Install]

WantedBy=multi-user.target

I have Halcyon running across Four servers. 2 Grid Servers. The first (Win 2012R2) running User, Grid, Messaging, Aperture and the bulk of the Halcyon.exe services. The second (Win 2012R2) running Aperture and several more Halcyon.exe (This is the server running the region I am trying to back up with save oar.) The third (kUbuntu 18.10) running WHIP and MySQL. The fourth (Win 2012R2) running IIS. Halcyon.exe contacts IIS and gets the region info as documented by @Vinhold in the wiki (https://github.com/HalcyonGrid/halcyon/wiki/Configuration).

kf6kjg commented 5 years ago

You can get a copy of the WHIP log from systemd. Vin should know how to do that if you need help.