1) I've been using aeson at work for the past four years and I thoroughly dislike it;
2) I wrote a small pretty-printer a month ago and suddenly it turned out aeson can't stream requests. The expected way of implementing streaming is I guess json-stream, a library that has a big ol' "actually we cheat a lot" disclaimer at the bottom of it;
3) There's probably lots of cool things that can be done here;
4) There's a library called json that has been dead for three years (I started on July 6th).
What this rewrite does not have
No unifying typeclass (ToJSON/FromJSON) and hence no GHC.Generics.
I enjoy handrolling instances and the idea of "the one true instance" gets in the way constantly. Generics also have an incredibly lame issue of inheriting pair names from record fields which are rather stringent in what they allow. Both of these are something a higher-level library can provide, hopefully in a more generalized way.
No unifying datatype (Value).
Adding something like this is completely optional and highly ambiguous depending on how much information about the JSON we wish to convey (do we remove duplicates from container values? Is whitespace critical? Should numbers be stored denormalized if they were delivered that way?). Also raw JSON already conveys all of this info by itself.
Corners explored while rewriting
attoparsec's API is... weird:
fail prepends extra text to the input, satisfy alters parser context when failing.
There's peekWord8, but there's no advance. I have to anyWord8 the character to consume it, so I have to hope the compiler is diligent enough to compile peekWord8 >> anyWord8 into peekWord8 >> advance 1.
take and match return a StrictByteString, meaning if the parser ever needs to backtrack ten thousand bytes, those ten thousand bytes have to be reallocated into one continuous block of memory
The solutions taken to address each of these are:
err is defined to replace fail. This pushes attoparsec's lower boundary up to 0.13, since that's when that library revealed Data.Attoparsec.Internal.Types;
Every single parser in the rewrite consumes one byte at a time and never backtracks. The only functions used are peekWord8, peekWord8' and anyWord8;
Since matching is off the table, copying is performed by accumulating a chain of poke operations that are forced on chunk overflow.
attoparsec-iso8601 uses explicitly Text parsers. I rewrote the functions from scratch to fit the "one byte at a time" rule described above, so it's not a big issue;
empty (from Alternative) does not apply to any decoders. I use the semigroupoids package for the Alt typeclass instead.
Streaming JSON decoding out of the box. For the overwhelming majority of users this will just be arrays, however streaming can be nested (see test/conformance/Test/JSON/Decode/Stream.hs for two-dimensional arrays and objects);
Four distinct ways to slice up an object:
Read the entire object, shove every key into a radix tree, then resolve the object parser. This object decoder is referred to as Plain, it's the most powerful and also the most wasteful;
Analyze the object parser for keys, read the entire object, shove every key we know we need into a radix tree, then resolve the object parser. This object decoder is referred to as Bounded and it's not a Monad. Branching is supported through Selective;
If the object parser is known to never branch: Analyze the object parser for keys and so long as they're all unique, all the decoding can be ran inline. This object decoder is referred to as Linear, it's strictly Applicative.
Fold left-to-right (both a basic and a streaming variant);
Two distinct ways to slice up an array:
Consume elements left to right. This array decoder is referred to as Elements and it mostly exists for weird encoding a la aeson's handling of tuples;
Fold left-to-right (both a basic and a streaming variant);
String parsing with inlined UTF-8 decoding. This caused quite a lot of code duplication, but it's also relatively easy to test rigorously, which I did;
O(n) complexity for sized number parsing. Since we know the size beforehand, integers can fail early and floats can skip extra precision;
Time and UUID decoding and encoding. Both are straightforward to implement and show up often enough to be included;
Every primitive can be decoded with a skip (discarding the result) and decoded raw (input copied verbatim). For testing sanity on verbatim copies all the whitespace is preserved.
Extra things I did (or didn't)
Every declaration is documented.
None of this has been properly optimized or benchmarked. I assume it's relatively fast, simply because there's not much to do in a byte-by-byte decoder, but some algorithms are probably suboptimal;
Tests, quite a few of them;
Dependency boundaries correspond to GHC 8.8 (GHC 8.6 was the last version of MonadFail proposal implementation and it also didn't have NumericUnderscores) with a hard requirement of text being at version 2.0 or higher (that's when they've switched to UTF-8). Due to the text being this high the bottom version is most probably going to be GHC 9.4+ on average;
Travis was replaced with CircleCI. This project compiles on GHC 8.8.4 and every latest major version since. I don't think there are any platform-dependent issues with this rewrite as both JSON and UTF-8 are byteorder-independent.
What could probably be added
Custom parser that stores all backtrackable chunks lazily as a difference list, perhaps even using the fresh new delimited continuations. This would have to be a separate library for both sanity and performance, and thankfully attoparsec is good enough for the job as is.
Non-UTF8 decoding/encoding. This shouldn't be hard to implement with GHC.IO.Encoding, but it's a rare sidecase and a generic solution needs recursive copying, which is verbose and needs non-trivial testing;
The radix tree probably belongs in its own library, however there is no proper "radix tree containers" library I know of;
time decoding probably belongs in attoparsec-iso8601. I implement a few more functions than that library and our two implementations only align on dependencies, so I assume it requires more than just a PR with my changes;
Number parsing ideally should use the carry flag instead of maxBound shenanigans, however GHC.Exts only exposes that functionality for Int# and Word#, not for any sized types. While doing this with C FFI would be trivial, I don't want to pollute the library with it. Oh, apparently even C doesn't implement direct access to the carry flag, the correct way to handle this is through lame comparisons. Color me surprised.
What lead me here
1) I've been using
aeson
at work for the past four years and I thoroughly dislike it; 2) I wrote a small pretty-printer a month ago and suddenly it turned outaeson
can't stream requests. The expected way of implementing streaming is I guessjson-stream
, a library that has a big ol' "actually we cheat a lot" disclaimer at the bottom of it; 3) There'sprobablylots of cool things that can be done here; 4) There's a library calledjson
that has been dead for three years (I started on July 6th).What this rewrite does not have
No unifying typeclass (
ToJSON
/FromJSON
) and hence noGHC.Generics
.I enjoy handrolling instances and the idea of "the one true instance" gets in the way constantly.
Generics
also have an incredibly lame issue of inheriting pair names from record fields which are rather stringent in what they allow. Both of these are something a higher-level library can provide, hopefully in a more generalized way.No unifying datatype (
Value
).Adding something like this is completely optional and highly ambiguous depending on how much information about the JSON we wish to convey (do we remove duplicates from container values? Is whitespace critical? Should numbers be stored denormalized if they were delivered that way?). Also raw JSON already conveys all of this info by itself.
Corners explored while rewriting
attoparsec
's API is... weird:fail
prepends extra text to the input,satisfy
alters parser context when failing.peekWord8
, but there's noadvance
. I have toanyWord8
the character to consume it, so I have to hope the compiler is diligent enough to compilepeekWord8 >> anyWord8
intopeekWord8 >> advance 1
.take
andmatch
return aStrictByteString
, meaning if the parser ever needs to backtrack ten thousand bytes, those ten thousand bytes have to be reallocated into one continuous block of memoryThe solutions taken to address each of these are:
err
is defined to replacefail
. This pushesattoparsec
's lower boundary up to0.13
, since that's when that library revealedData.Attoparsec.Internal.Types
;peekWord8
,peekWord8'
andanyWord8
;match
ing is off the table, copying is performed by accumulating a chain ofpoke
operations that are forced on chunk overflow.attoparsec-iso8601
uses explicitlyText
parsers. I rewrote the functions from scratch to fit the "one byte at a time" rule described above, so it's not a big issue;empty
(fromAlternative
) does not apply to any decoders. I use thesemigroupoids
package for theAlt
typeclass instead.What this rewrite provides
Basic JSON decoding and encoding. Decoding consumes input lazily, encoding produces output lazily.
Streaming JSON decoding out of the box. For the overwhelming majority of users this will just be arrays, however streaming can be nested (see
test/conformance/Test/JSON/Decode/Stream.hs
for two-dimensional arrays and objects);Four distinct ways to slice up an object:
Plain
, it's the most powerful and also the most wasteful;Bounded
and it's not aMonad
. Branching is supported throughSelective
;Linear
, it's strictlyApplicative
.Two distinct ways to slice up an array:
Elements
and it mostly exists for weird encoding a laaeson
's handling of tuples;String parsing with inlined UTF-8 decoding. This caused quite a lot of code duplication, but it's also relatively easy to test rigorously, which I did;
O(n)
complexity for sized number parsing. Since we know the size beforehand, integers can fail early and floats can skip extra precision;Time and
UUID
decoding and encoding. Both are straightforward to implement and show up often enough to be included;Every primitive can be decoded with a skip (discarding the result) and decoded raw (input copied verbatim). For testing sanity on verbatim copies all the whitespace is preserved.
Extra things I did (or didn't)
Every declaration is documented.
None of this has been properly optimized or benchmarked. I assume it's relatively fast, simply because there's not much to do in a byte-by-byte decoder, but some algorithms are probably suboptimal;
Tests, quite a few of them;
Dependency boundaries correspond to GHC 8.8 (GHC 8.6 was the last version of
MonadFail
proposal implementation and it also didn't haveNumericUnderscores
) with a hard requirement oftext
being at version2.0
or higher (that's when they've switched to UTF-8). Due to thetext
being this high the bottom version is most probably going to be GHC 9.4+ on average;Travis was replaced with CircleCI. This project compiles on GHC 8.8.4 and every latest major version since. I don't think there are any platform-dependent issues with this rewrite as both JSON and UTF-8 are byteorder-independent.
What could probably be added
Custom parser that stores all backtrackable chunks lazily as a difference list, perhaps even using the fresh new delimited continuations. This would have to be a separate library for both sanity and performance, and thankfully
attoparsec
is good enough for the job as is.Non-UTF8 decoding/encoding. This shouldn't be hard to implement with
GHC.IO.Encoding
, but it's a rare sidecase and a generic solution needs recursive copying, which is verbose and needs non-trivial testing;The radix tree probably belongs in its own library, however there is no proper "radix tree containers" library I know of;
time
decoding probably belongs inattoparsec-iso8601
. I implement a few more functions than that library and our two implementations only align on dependencies, so I assume it requires more than just a PR with my changes;Number parsing ideally should use the carry flag instead of
maxBound
shenanigans, howeverGHC.Exts
only exposes that functionality forInt#
andWord#
, not for any sized types.While doing this with C FFI would be trivial, I don't want to pollute the library with it.Oh, apparently even C doesn't implement direct access to the carry flag, the correct way to handle this is through lame comparisons. Color me surprised.