FasterXML / jackson-core

Core part of Jackson that defines Streaming API as well as basic shared abstractions
Apache License 2.0
2.25k stars 786 forks source link

Support Unquoted Values #723

Open ryber opened 2 years ago

ryber commented 2 years ago

In both Gson and org.json a unquoted value will be interpreted as a string. So given this array (note that apple is not quoted):

[33.5, 42, "foo", true, apple]

The resulting ArrayNode equivalent in those frameworks (JsonArray and JSONArray respectfully) will interpret it as

[33.5, 42, "foo", true, "apple"]

I have a OSS utility (Unirest-Java), that will allow users to bring their own favorite Json parsing engine. So I have an abstraction for them, and trying to make them operate with the same rules. This one I can't deal with unless Jackson were to handle it. I noticed you have a JsonReadFeature for ALLOW_UNQUOTED_FIELD_NAMES, it would be super cool if there was a ALLOW_UNQUOTED_VALUES

Thanks for all the hard work. Jackson is a fantastic library and a real backbone of the world of Java you should be really proud of it!

cowtowncoder commented 2 years ago

Yes, this would require a new option like JsonReadFeature.ALLOW_UNQUOTED_VALUES. I probably won't have time to tackle this, but maybe someone else might find time. This feature would need to be implemented for all 4 backends (Reader, InputStream, DataInput, non-blocking); some of code used for unquoted field names might prove valuable.

One thing to note is that the definition of where an unquoted value ends will probably require "interesting" heuristics -- given that such values are not legal JSON (and shame on org.json / Gson accepting them by default...), there is no definition of how such values should be decoded; choice of end markers is rather large.

micw commented 1 year ago

Hello @cowtowncoder, would you probably find the time to investigate in this feature request?

Kind regards, Michael.

cowtowncoder commented 1 year ago

@micw No I don't think I will have time to work on this anytime soon. Person who has worked on similar things lately is @pjfanning (not saying he'd have time or interest, but that he is probably quite familiar with code by now).

pjfanning commented 1 year ago

Honestly, I simply don't understand the need to support cases that are disallowed by the JSON spec - this change would be low priority for me.

cowtowncoder commented 1 year ago

@pjfanning Fair enough. I think majority of these requests are for various JSON-like variants like something called "JSON5" (see #734), and may make slightly more sense. But I agree that it's sort of slippery slope.

GedMarc commented 1 year ago

unquoted keys are one thing, unquoted values seem to border on the ridiculous - You would need to restrict all values to not have the "," character, or set the end of key-value pair value to anything.... which many, MANY, will - Then on investigation you still have to determine if its numeric, boolean, or a string value - overhead would be crazy -

Not a viable request in my opinion @micw @ryber

pjfanning commented 1 year ago

I'm not sure this is allowed by json5 - see https://json5.org/ - Strings can be single or double quoted but need some sort of quotes - are you said above, without quotes, it is hard to spot the end of the string.

In the example from the OP, you would need to trim the string - so you would probably need 2 deserialization features - one to support having no quotes and the other to support trimming the results.

micw commented 1 year ago

My use case is to process lot of json from sources not under my controll. In particular, I read statistics from "freifunk" nodes (free, community driven wifi mesh nodes) to combine it to a map and load it into grafana.

It used to happen that the script generated JSON has some missing quotes. Especially when a script that is expected to return a number returns an error message instead ;-)

I agree that this is a rare use case and the better option would be to fix the script. But on the other hand, there is an implementation with a different JSON parser (need to figure out which one) which does not fail on this issue ;-)

cowtowncoder commented 1 year ago

@GedMarc Yeah. YAML is pretty problematic when "no" means false etc.

@pjfanning Ok. I wasn't sure if this was specifically related to JSON5; good to know it is not.

cowtowncoder commented 1 year ago

@micw Thank you for sharing more details of your use case. Yes, such "JSON" should be fixed if and when it is a flow in generation and not following some JSON-like alternative format.

But just to be clear: any JSON parser that accepts such values is technically non-compliant. Jackson strictly enforces JSON specification unless instructed to allow certain deviations. So even if GSON did alllow this (does it really? :-o ), or org.json I would not consider that a strong argument for inclusion.

ryber commented 1 year ago

Yes, GSON and org.json really do this. Here are a couple of tests (BTW, I personally don't really care, I think it's silly that they actually do this, I only opened the issue originally to kind of document the discrepancy:

   @Test
    void orgJsonUnquotedvalues() {
        JSONObject o = new JSONObject("{\"foo\": bar, \"baz\": qux}");
        assertEquals("bar", o.getString("foo"));
    }

   @Test
    void gsonUnquotedValues() {
        Gson gson = new Gson();
        JsonObject e = gson.fromJson("{\"foo\": bar, \"baz\": qux}", JsonObject.class);
        assertEquals("bar", e.get("foo").getAsString());
    }
cowtowncoder commented 1 year ago

Ok thank you for confirming @ryber. Interesting. Filing an issue (be it for documentation or having option for strict handling) makes sense.

ryber commented 1 year ago

Interestingly, org.json will let you have spaces in the value, but GSON will not.

micw commented 1 year ago

Yes. Once you leave the spec path, results might be surprising ;-)

niquola commented 1 year ago

Hi, guys. This is an essential feature for us! We need identifiers in json (like symbols in cloure/EDN or variables in js):

{ 
  type: namespace.Identifier,
  subtype: Enum/Type,
  value: "...",

}

Can I somehow extend the existing Jackson (something like a hook)? Or can it only be implemented in the core?

niquola commented 1 year ago

Hi, guys. This is a very important feature for us! Can I somehow extend the existing Jackson (something like a hook)? Or can it only be implemented in the core?

Oh, I see, it is hardcoded here https://github.com/FasterXML/jackson-core/blob/8093f4348b1828b7f228d28a4027b766dc88395f/src/main/java/com/fasterxml/jackson/core/json/ReaderBasedJsonParser.java#L929

cowtowncoder commented 1 year ago

Yes, there are no hooks as this really is a low-level tokenization aspect and for performance reason that is not modular. Nor is it clear whether supporting things that are strictly against JSON spec is beneficial per se.

Having said that, there are a few other opt-in settings so if someone was to tackle this, it would be considered for inclusion.

micw commented 1 year ago

A pluggable tokenizer with implementation for "strict" and "lax" could do it and should not affect performance.

pjfanning commented 1 year ago

It might be faster to use https://github.com/marhali/json5-java and if you need to integrate with other Jackson modules, you could write a class that implements the Jackson JsonParser API and that delegates to the json5-java parser.

This could be done in a 'dataformat' module - a bit like how https://github.com/FasterXML/jackson-dataformat-xml delegates the parser/generator work to woodstox.

cowtowncoder commented 1 year ago

A pluggable tokenizer with implementation for "strict" and "lax" could do it and should not affect performance.

I don't think so: adding true pluggability itself would almost certainly add measurable overhead if done within jackson-core. I have no interest in spending time to create unnecessary abstractions (which I consider this to be).

Implementing alternate handling with switches (as is done for other aspect) is different story and does not need to add significant overhead.

As per @pjfanning's suggestion, yes, implementing alternate format backend would make sense for anyone wanting to tackle this: this is kind of pluggability Jackson already supports quite well.

ash211 commented 5 months ago

Relevant for https://github.com/predibase/lorax/issues/392, I would be interested in a Jackson option that allowed parsing this invalid JSON:

{"key": 2022-02-02}

as if it were this valid JSON:

{"key": "2022-02-02"}

by treating the first character after {"key": that did not match regex [0-9T:.+-] as the end of the string. The regex represents characters in valid ISO 8601 datetimes with time zone (might need tweaks to be 100% correct).

My first preference is to fix the code creating this invalid JSON to create valid JSON instead of using a Jackson option, but if a Jackson option did exist I would be using it in the interim until the JSON generation is fixed. Sharing as a data point.