cinderblocks / libremetaverse

An fork of the libopenmetaverse library striving for performance improvements and up-to-date compatibility with SL/OS/Halcyon
BSD 3-Clause "New" or "Revised" License
60 stars 40 forks source link

Fixing Dictionary.Add(key, value) giving the error that a duplicate key exists and exits #73

Closed GwynethLlewelyn closed 2 years ago

GwynethLlewelyn commented 2 years ago

Hi,

Please be patient as I submit my first contribution :) Also, I'm not a C# programmer myself, so many of my assumptions can be simple ignorance and lack of experience — with the language itself, with the surrounding environment (CLI-based builds on Linux and macOS), and the LibreMetaverse library itself, of which I have been a very occasional user — over a decade ago :)

Taking all the above into account, here's my issue: I've been trying to fix an obscure memory leak on one of my projects which uses LibreMetaverse. It's very likely that the issue is on 'my' side, but, as I went deep into the logs, to figure out where exactly my problem is, I found a few bugs in the LibreMetaverse itself, some of which might have been around for ages, and that have only been rooted out because of unusual conditions (I'm just guessing!).

The first one I wanted to fix was relatively easy to spot: I was getting thread dumps from Dictionary.Add(key, value) throwing the (uncatched) error that the key already existed (on ObservableDictionary.cs and InternalDictionary.cs). This happened when the code decided to re-insert a session key — during an attempted retry to login back again — that was still active.

The quick-and-dirty fix, of course, is to use Dictionary[key] = value; instead — which will handle 'duplicate keys' by simply deleting the old entry and creating a new one — so it will always succeed.

That certainly fixed things upstream, but I have no idea if this is the intended behaviour (e.g. throw explicit errors and exit if a key is inserted again, instead of simply replacing its value) — or it was simply left that way in the code since the late 2000s and just forgotten :) (I could check — I didn't)

As I was doing that, by sheer chance, I came across Microsoft's announcement of the end of life for .NET 5. Since I've been exclusively using .NET 6 for a while now (and eager to test out .NET 7!), I thought I could try to add net6.0 support to LibreMetaverse, and hope that it didn't break anything major. It didn't — however, it showed that the code is crowded with all sorts of dangerous, obsolete libraries, which, while convenient, can be security issues. I did not change those. My biggest concern at the moment is worrying about the binary format of saved files using the BinaryFormatter.Serialize functionality; Microsoft's documentation gives some hints as how to replace it by a safe solution (having object serialisation to XML as opposed to binary would be an alternative — but would make files so much bigger), none of them, however, being drop-in replacements, but requiring some extra work. That would also mean that 'old' backup files would stop working — never a good idea.

I was also getting bombarded with a ton of Unrecognized internal caps exception errors. These were mostly uncaught because they came from the grid with a 500 Internal Server Error message. Now, AFAIK, the current convention is that 'anything which is not OK and not explicitly recognised should be treated as a request by the client to drop the caps connection' (I'm pretty sure I read this somewhere, but I've spent the past two hours searching wildly for that reference — and couldn't find it again 😢 ). This is perhaps an awkward way of terminating a connection, but, alas, it seems that it's the way it's done. As such, 500 error codes should not be merely ignored (and letting the code to process things continue to execute): what happens on the current version is that the error code, being 'unknown', just forces the client to request the same capability again — which gets promptly another 500 from the grid — and so on and on. As a result, the client will now be holding several open caps requests that will never be fulfilled — each taking precious memory, CPU cycles, and some bandwidth. That's not a huge issue with a single agent (it'll still worrisome, though), but once you start launching several agents all across the grid, you run into serious trouble.

My simple fix is to terminate the connection when a 500 is identified and received. This supposedly disposes of any overhead required to keep that connection alive. I cannot see any difference in perceived functionality when watching what the 'bots are doing in-world. What certainly happens is that the client will ask for new capabilities, if those it got before expired (usually three, and usually immediately after logging in) or are invalid. This, I believe, is supposed to be the expected behaviour. I've certainly seen the client much happier (very, very few warning messages), and the memory consumption, while high, at least seems to be bounded between a 'high' and 'low' limit (around 5-10 MBytes per agent), instead of growing indefinitely.

While doing all the above, I found that it was a nightmare to recompile (and, later on, package) the whole project under a non-GUI environment — such as what I have on my two test platforms, Linux and macOS. Granted, there are only three subprojects that require the GUI — the rest is perfectly platform-agnostic (I haven't tried to compile it on my brand new Raspberry Pi Zero W, but I have all reasons to suspect that it will work flawlessly :) ). After several (failed) experiments to try to write an unified solution file that could handle the GUI and non-GUI environments — and skimmed through endless discussions on social media about how silly it was for Microsoft not to provide a simple mechanism as an alternative to a 'hard fail' when trying to compile a GUI project on a non-GUI environment — I gave up, and I just basically created from scratch a new solution file for the non-GUI crowd (a solution which, apparently, seems to be what a majority of C# developers use — not without much grumbling and complaining, of course). The trouble will now to keep both solution files in sync: it's naturally much easier if you're doing the development under Windows, of course.

Since the two GUI-based subprojects are merely 'examples' — they're neither required to exist, nor are they supposed to be some sort of 'flagship demo' of the capabilities of LibreMetaverse — I'd guess that it would be far easier to simply replace the Windows GUI with a multi-platform-enabled alternative, such as those listed and suggested in this StackOverflow thread. On GitHub is also a thorough side-by-side comparison of the leading three cross-platform solutions. For the purpose of the demo examples found in LibreMetaverse, I'd probably go with Avalonia, but that's a discussion for another day...

I've also added, at the end of the README.md, some minimalistic information on how to use LibreMetaverse for utter .NET newbies (such as myself!).

Note: I haven't checked all the CI pipelines, but I'm aware that CodeQL, at least, does not support net6.0 yet, which means that if you use it, it will throw an error for the net6.0 compilation (net5.0, of course, will succeed). I hope they do an upgrade to their code soon — since net5.0 is going away, net7.0 is at the door, and that means that CodeQL is falling behind...

GwynethLlewelyn commented 2 years ago

Ah! It seems that you're using Appveyor! Well, sadly, Appveyor also hasn't updated its SDKs. It's using 6.0.201 for Windows builds and 6.0.200 for Linux. The current release is 6.0.202. Is there anything that can be done (config file, Appveyor settings) to overcome that?

AlexMcArdle commented 2 years ago

Great improvements!

I was able to get CodeQL to pass with cinderblocks/libremetaverse/pull/74. This configures CodeQL to build the individual projects (all of the non-GUI related ones) rather than the entire solution.

GwynethLlewelyn commented 2 years ago

Awesome! Thanks so much, @AlexMcArdle — as far as I can see, your CodeQL-fu worked flawlessly, all tests have now passed, and I can also confirm on my side that locally compiling (under macOS and Ubuntu 20.04) still works (heh!).

All we need now is a workflow approval :) and I guess this is good to go!

cinderblocks commented 2 years ago

Thank you for the contributions. I had not seen these as I have been dealing with a lot lately, and GitHub e-mails were being sorted to my junk folder.

GwynethLlewelyn commented 2 years ago

No worries, I was also trying to tweak things so that they gave no errors on the (many) Ci workflows, so I missed your PR commit... lol

Thanks for all your patience!