Closed dtolnay closed 5 years ago
Although I appreciate the proposed separation of concerns, you basically have to parse the JSON file twice, right? For example, to implement StripComments
, you can't just look at each token in isolation because you might be inside of a string literal where //
should be kept rather than discarded.
If StripComments
has to do all the work of a JSON parser, then why not just make it a lenient JSON parser in its own right rather than a preprocessor of serde_json
?
you basically have to parse the JSON file twice, right?
I don't think this is the case. The only relevant piece of state for the wrapper is whether it is inside of a string literal or not. There doesn't need to be any parsing of {
}
[
]
, or numbers, or not even escape sequences beyond \\
and \"
.
I think the amount of parsing dedicated to "
and \
will be less than the amount required to correctly handle //
and \n
and /*
*/
.
The only relevant piece of state for the wrapper is whether it is inside of a string literal or not.
I guess that's true if the JSON is well-formed. If it isn't, I suspect yielding a proper parse error becomes trickier.
But also: what about the next logical feature, which is trailing commas? I consider it strongly tied to comments because if you don't support trailing commas and you have this:
[
"foo",
"bar",
"baz"
]
and you comment a single line:
// This doesn't parse anymore!
[
"foo",
"bar",
// "baz"
]
For JSON files that may be edited by hand where comments are supported, it's pretty common to see people prefer trailing commas (even though they're optional) to make this sort of thing easier.
I would leave the error messages to serde_json rather than diagnosing syntax errors in the wrapper.
Implementation note: we should sub out the comments with spaces, rather than removing them entirely, so that serde_json's error reporting of the line and column position will accurately apply to the input data.
For trailing commas, it still only needs the "am I in a string" amount of state. If not inside of a string, then ,
+ whitespace + ]
is emitted as ]
, ,
+ whitespace + }
is emitted as }
. This amount of parsing is inherent to handling trailing commas so you are not reproducing any work from serde_json.
Sort of. For example: I would say that [,]
should not be considered valid "JSON with comments." I think the heuristics you are proposing would rewrite that as [ ]
, which would parse.
Ah, good call. If it's simpler to fork serde_json and integrate these changes into the existing parsing code, I would go ahead with that.
@dtolnay That sounds fair, thanks!
I discovered that there is also https://crates.io/crates/json5, but it depends on https://crates.io/crates/pest so it can use a PEG. I was trying to avoid additional dependencies, but perhaps I should give that a shot.
Also, https://crates.io/crates/json5 offers from_str
, but not from_reader
, which is one thing I really like about serde_json
! Nice work with your IoRead
to wrap io::Read
to facilitate that!
I put together a simple crate for this: https://crates.io/crates/json_comments
Thanks @tmccombs, this is terrific!
I filed several issues to follow up on the API and documentation.
Oh, perhaps I should have mentioned that I did end up making that fork:
https://github.com/bolinfest/serde-jsonrc
As explained in the README, it supports //
and /*
style comments as well as trailing commas.
Occasionally people ask for JSON deserialization with comments, without wanting to go all the way to JSON5 or Hjson.
This doesn't belong in serde_json which is intended strictly for JSON as described by https://json.org/.
I would like to be able to write:
Roughly the API would be: