crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.25k stars 1.61k forks source link

Sharing `String`s between deserializations of a `JSON::Serializable` type #13638

Open HertzDevil opened 1 year ago

HertzDevil commented 1 year ago

Each JSON::Lexer owns its StringPool for key strings, and calling from_json on a JSON::Serializable type creates a fresh JSON::Lexer. Those string pools are good at reusing JSON keys in the same string, but not when deserializing multiple objects from different strings. Consider this:

require "json"

record Point, x : Int32, y : Int32 do
  include JSON::Serializable
end

10000.times do |i|
  Point.from_json %({"x":#{i},"y":#{i}})
end

This will create 10,000 StringPools, the strings x and y will each be looked up exactly once per StringPool, and all of them will miss. Recently I had profiled a Crystal application only to discover that those key strings are among the largest living objects allocated throughout program execution. IMO the standard library should do better than this.

A naive approach would be sharing one StringPool among all deserializations, say via a class variable. This means those pools won't ever be garbage-collected, and if the lexers look them up on all JSON keys, the memory would blow up whenever one encounters a Hash(String, _) field or a JSON::Serializable::Unmapped type, where JSONs with many different keys are expected to parse successfully.

Instead, what we really want is a pool of string literals for all the field names. Conceptually, it translates a Bytes to a String, with the constraint that the target strings must reside in read-only memory and not allocate:

record LiteralPool, literals : Array(String) do
  def get(bytes : Bytes)
    @literals.find &.to_slice.== bytes
  end
end

class JSON::Lexer
  # needs suitable forwarding in `StringBased` and `IOBased`
  def initialize(@literal_pool : LiteralPool)
    # ...
  end

  private def consume_string_with_buffer(&)
    # ...
    if @expects_object_key
      @token.string_value = @literal_pool.get(@buffer.to_slice) || @string_pool.get(@buffer)
    else
      @token.string_value = @buffer.to_s
    end
  end
end

class JSON::PullParser
  def initialize(input, literal_pool)
    @lexer = Lexer.new input, literal_pool
  end
end

struct Point
  # for exposition only; the `%w` would be generated by a macro
  class_getter __json_literals : LiteralPool { LiteralPool.new %w(x y) }
end

def Point.from_json(string_or_io)
  parser = JSON::PullParser.new(string_or_io, Point.__json_literals)
  new parser
end

The LiteralPool implementation above would be slower than a StringPool, but alternative data structures exist (e.g. prefix tree) that would have comparable time complexities, and some can even themselves be defined in read-only memory. (Note that Hash isn't one of them as Object#hash is never a constant expression.)

Now the StringPool would only be useful for Hash(String, _) and JSON::Serializable::Unmapped in JSONs that nonetheless have many duplicate keys. If we believe these are rare, we could make JSON::Lexer#@string_pool a lazy getter or even drop it entirely.

straight-shoota commented 1 year ago

It feels like LiteralPool could be much more generalized. But I'm wondering why it needs to be different fom StringPool? Isn't it the same behaviour except that it does not register the new string if not found? Couldn't this be solved with a separate method for this use case on StringPool?

Thinking bigger, maybe the compiler should put all literals into a pool for easy access to re-use them? Strings are designed to be immutable, so there should be no mutability issues. Literals are even write protected so if you try to mess with a pooled string, it will explode. For mutating string data, you'd need to explicitly dup it.

crysbot commented 3 months ago

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/stringpool-make-the-hash-key-part-of-the-public-api/6766/5