StirlingLabs / BigBuffers

Memory Efficient Serialization Library able to deal with big datasets. Based on FlatBuffers with minimal changes.
Apache License 2.0
2 stars 1 forks source link

Use native dictionary for key on C#? #3

Open Rjvs opened 3 years ago

Rjvs commented 3 years ago

As Ben says:

It (currently) tries to emulate the behaviour of a hash map with binary search and a sorted array, I think it's better we just use native tools where they're available for quick insertion and update.

TYoungSL commented 3 years ago

Yeah the generated code uses Array.Sort when building collections and a binary search when resolving keys; it should probably be replaced with a Dictionary<T>. It also calls System.Text.Encoding.UTF8.GetBytes for strings when looking them up. It's not great.

The implementation generated appears to be an order of magnitude slower than any other dictionary/map implementation. It'd be better to use just Dictionary<T> with a generated Comparer<T> that properly uses the Offsets and ArraySegment<byte>s or something.

The existing implementation in generated code is CreateSortedVectorOf; you have to do the tooling around the keyed types yourself anyway. Using a dictionary would even be less code.

I think this is cruft translated over from the C/C++ side that is mirrored across all the language generators. It's not the best solution for C#, Go, or any other managed language. Big oof.

Even map/unordered_map for C++ would do better here.

I think we drop it from code generation and create comparers for keyed types that can be used for C#, analogs for C++ and Go. Maybe keep it for C? Not sure.

If we retool it to create generated comparers, it wouldn't need to compare entire structure as it would with other models, it would only compare the keys; the language libraries can deal with that properly when throwing them into a collection. For non-keyed types, we'd do full structural comparers.

The reason there's no single shared collection for keyed models is pretty obvious to me; could have multiple contexts and you need a way to namespace them, but even namespaced collections would be more overhead than letting the consumer manage the collections, sometimes they'll namespace, sometimes they'll singleton it. It's also more code to implement, (slightly) less performant, there are data isolation hazards, lots of other problems.

EF and similar data model frameworks have you create models of the entire schema and bind to that, that becomes your isolation context; the collections on the schema model are the tables.

We could do that for BigBuffers, but it's probably better to just do it manually for now.

Ben should definitely follow the C# convention of creating Comparer<T>s for the models, especially for types we have hash keys for.

We should probably use the backing store for comparing the UTF8 bytes, Span<byte>s, SequenceEqual, implement GetHashCode, and support comparing against native String types too... .NET uses XxHash32 now for Strings and other types I think. Could read the UTF8 bytes as Runes (like StringRuneEnumerator) to create hash codes that match between native String and UTF8 byte arrays... Would be able to push that work into a shared into BigBuffers or into a separate library, or create the Comparer in Utilities.Net.

TYoungSL commented 3 years ago

IEquatable<> and IComparable<> are implemented for structs and tables with keys. 😄