Open HertzDevil opened 1 year ago
It feels like LiteralPool
could be much more generalized. But I'm wondering why it needs to be different fom StringPool
? Isn't it the same behaviour except that it does not register the new string if not found? Couldn't this be solved with a separate method for this use case on StringPool
?
Thinking bigger, maybe the compiler should put all literals into a pool for easy access to re-use them? Strings are designed to be immutable, so there should be no mutability issues. Literals are even write protected so if you try to mess with a pooled string, it will explode. For mutating string data, you'd need to explicitly dup it.
This issue has been mentioned on Crystal Forum. There might be relevant details there:
https://forum.crystal-lang.org/t/stringpool-make-the-hash-key-part-of-the-public-api/6766/5
Each
JSON::Lexer
owns itsStringPool
for key strings, and callingfrom_json
on aJSON::Serializable
type creates a freshJSON::Lexer
. Those string pools are good at reusing JSON keys in the same string, but not when deserializing multiple objects from different strings. Consider this:This will create 10,000
StringPool
s, the stringsx
andy
will each be looked up exactly once perStringPool
, and all of them will miss. Recently I had profiled a Crystal application only to discover that those key strings are among the largest living objects allocated throughout program execution. IMO the standard library should do better than this.A naive approach would be sharing one
StringPool
among all deserializations, say via a class variable. This means those pools won't ever be garbage-collected, and if the lexers look them up on all JSON keys, the memory would blow up whenever one encounters aHash(String, _)
field or aJSON::Serializable::Unmapped
type, where JSONs with many different keys are expected to parse successfully.Instead, what we really want is a pool of string literals for all the field names. Conceptually, it translates a
Bytes
to aString
, with the constraint that the target strings must reside in read-only memory and not allocate:The
LiteralPool
implementation above would be slower than aStringPool
, but alternative data structures exist (e.g. prefix tree) that would have comparable time complexities, and some can even themselves be defined in read-only memory. (Note thatHash
isn't one of them asObject#hash
is never a constant expression.)Now the
StringPool
would only be useful forHash(String, _)
andJSON::Serializable::Unmapped
in JSONs that nonetheless have many duplicate keys. If we believe these are rare, we could makeJSON::Lexer#@string_pool
a lazy getter or even drop it entirely.