Wrapper to strip pseudo-JSON comments from an io::Read input stream

dtolnay commented 5 years ago

Occasionally people ask for JSON deserialization with comments, without wanting to go all the way to JSON5 or Hjson.

This doesn't belong in serde_json which is intended strictly for JSON as described by https://json.org/.

I would like to be able to write:

use std::error::Error;
use std::fs::File;
use std::io::BufReader;

use serde::Deserialize;
use strip_json::StripComments;

fn main() -> Result<(), Box<Error>> {
    let input = File::open("path/to/input.json")?;
    let reader = BufReader::new(input);
    let clean = StripComments::new(reader);
    let mut de = serde_json::Deserializer::from_reader(clean);

    println!("{:#?}", serde_json::Value::deserialize(&mut de));

    Ok(())
}

Roughly the API would be:

pub struct StripComments<R>;

impl<R> StripComments<R> {
    pub fn new(stream: R) -> Self;
}

impl<R: Read> Read for StripComments<R>;

bolinfest commented 5 years ago

Although I appreciate the proposed separation of concerns, you basically have to parse the JSON file twice, right? For example, to implement StripComments, you can't just look at each token in isolation because you might be inside of a string literal where // should be kept rather than discarded.

If StripComments has to do all the work of a JSON parser, then why not just make it a lenient JSON parser in its own right rather than a preprocessor of serde_json?

dtolnay commented 5 years ago

you basically have to parse the JSON file twice, right?

I don't think this is the case. The only relevant piece of state for the wrapper is whether it is inside of a string literal or not. There doesn't need to be any parsing of { } [ ], or numbers, or not even escape sequences beyond \\ and \".

I think the amount of parsing dedicated to " and \ will be less than the amount required to correctly handle // and \n and /* */.

bolinfest commented 5 years ago

The only relevant piece of state for the wrapper is whether it is inside of a string literal or not.

I guess that's true if the JSON is well-formed. If it isn't, I suspect yielding a proper parse error becomes trickier.

But also: what about the next logical feature, which is trailing commas? I consider it strongly tied to comments because if you don't support trailing commas and you have this:

[
  "foo",
  "bar",
  "baz"
]

and you comment a single line:

// This doesn't parse anymore!
[
  "foo",
  "bar",
//  "baz"
]

For JSON files that may be edited by hand where comments are supported, it's pretty common to see people prefer trailing commas (even though they're optional) to make this sort of thing easier.

dtolnay commented 5 years ago

I would leave the error messages to serde_json rather than diagnosing syntax errors in the wrapper.

Implementation note: we should sub out the comments with spaces, rather than removing them entirely, so that serde_json's error reporting of the line and column position will accurately apply to the input data.

For trailing commas, it still only needs the "am I in a string" amount of state. If not inside of a string, then , + whitespace + ] is emitted as ], , + whitespace + } is emitted as }. This amount of parsing is inherent to handling trailing commas so you are not reproducing any work from serde_json.

bolinfest commented 5 years ago

Sort of. For example: I would say that [,] should not be considered valid "JSON with comments." I think the heuristics you are proposing would rewrite that as [ ], which would parse.

dtolnay commented 5 years ago

Ah, good call. If it's simpler to fork serde_json and integrate these changes into the existing parsing code, I would go ahead with that.

bolinfest commented 5 years ago

@dtolnay That sounds fair, thanks!

I discovered that there is also https://crates.io/crates/json5, but it depends on https://crates.io/crates/pest so it can use a PEG. I was trying to avoid additional dependencies, but perhaps I should give that a shot.

Also, https://crates.io/crates/json5 offers from_str, but not from_reader, which is one thing I really like about serde_json! Nice work with your IoRead to wrap io::Read to facilitate that!

tmccombs commented 5 years ago

I put together a simple crate for this: https://crates.io/crates/json_comments

dtolnay commented 5 years ago

Thanks @tmccombs, this is terrific!

I filed several issues to follow up on the API and documentation.

bolinfest commented 5 years ago

Oh, perhaps I should have mentioned that I did end up making that fork:

https://github.com/bolinfest/serde-jsonrc

As explained in the README, it supports // and /* style comments as well as trailing commas.

dtolnay / request-for-implementation

Wrapper to strip pseudo-JSON comments from an io::Read input stream #24