Closed eiriktsarpalis closed 1 year ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
cc @YohDeadfall
Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.
Author: | eiriktsarpalis |
---|---|
Assignees: | krwq |
Labels: | `area-System.Text.Json` |
Milestone: | 8.0.0 |
Is matching JSON.Net really a concern anymore? It'll be 5 .net versions by the time this goes out, and it seems that STJ stands on its own at this point.
Given that it is new behavior, it stands to reason that we do some degree of due diligence, so at the bare minimum any divergences are introduced intentionally rather than accidentally.
When I worked on the policies I looked at other canonical implementations in various languages and libraries, so it's not an accidental implementation. Tests are taken from externals sources too to verify compatibility. So, I guess it would be better to verify the current behavior with different popular implementations like it's in TypeScript, Rust which can be used for frontend development too, and so on.
From my Npgsql time I remember that I tweaked the code taken by @roji or someone else before just to make more expectable results than Json.NET makes. I wouldn't align the new policies with that library.
I think we should
_
, some weird F# naming, whatever we can think of where we might observe differences)then given all of the above we should decide next steps - we should really do what's most intuitive to users and that may be influenced by existing behavior or not. If it means breaking change in some obscure scenario I'm ok with that. We should have perf in mind when looking at any fixes to the current default policy - we don't want to have 5% regression in E2E scenario
I did a fuzzing run on the new naming policies, comparing their outputs with those of the equivalent Json.NET policies. It found a number of divergences primarily related to the handling of non-letter characters. More specifically (assuming JsonNamingPolicy.SnakeCaseLower
):
_foo
will be rendered foo
, a%
will be rendered a
and _?#__
will be rendered as the empty string. The Json.NET naming strategy preserves the original names in all cases and doesn't trim characters.
foo%bar
will get normalized to foo_bar
.$type
will be rendered $type
, matching the Json.NET output.__foo__bar__
will be rendered foo_bar
by STJ (with Json.NET preserving the original __foo__bar__
value).These divergences are concerning for a number of reasons:
_foo
and Foo
both normalize to foo
.While I don't have reason to believe that the Json.NET implementation is more correct than what we current have, it's probably for the best that we try to emulate it as much as possible. I'll submit a PR changing the implementation and updating/extending tests shortly.
@YohDeadfall you mentioned earlier that the current implementation was following best practices from naming policy components in other platforms. Would it be possible to share a few of these and why you think we should be emulating them?
I did a fuzzing run on the new naming policies, comparing their outputs with those of the equivalent Json.NET policies.
The world doesn't end on Microsoft and .NET with its ecosystem. While I understand that System.Text.Json
is aimed to replace JSON.NET
which is quite popular it's not the best thing ever. It's just what people are used to, but again, in case of cross platform communications things can be pretty screwed.
At the same time, non-punctuation symbols are not being trimmed, for instance the name $type will be rendered $type, matching the Json.NET output.
That might be unintentional since the original code was using Unicode text segmentation for word extraction and then applying a casing policy with a follow up concatenation.
Would it be possible to share a few of these and why you think we should be emulating them?
The research was done a while ago during my previous attempt on bringing naming policies to .NET: https://github.com/dotnet/corefx/pull/41354#issuecomment-554958476. Scroll up and down to see other relevant comments. What I can recall at the moment that my work was also inspired by hecks crate in Rust because it was one of popular naming converters non-utilizing regular expressions in contradiction to what I saw for Python and JavaScript.
I did a fuzzing run on the new naming policies, comparing their outputs with those of the equivalent Json.NET policies.
The world doesn't end on Microsoft and .NET with its ecosystem. While I understand that System.Text.Json is aimed to replace JSON.NET which is quite popular it's not the best thing ever. It's just what people are used to, but again, in case of cross platform communications things can be pretty screwed.
I would agree with that, and if there's a way we can improve interoperability with other platforms we should certainly be pursuing it. What I'm failing to understand though is how this could be achieved with a naming policy: it's a feature catering to code-first as opposed to schema-first approaches where JSON is typically either being roundtripped by the same serializer, or is called by clients explicitly using a schema that has been generated by the server's naming policy. I don't ever recall seeing a setup where, say, a Python client is calling a Java server where each peer uses its own naming policy with the expectation that the resultant JSON property names match -- that sounds like an extremely brittle configuration.
So if indeed this is a feature meant to cater to code-first .NET applications, my impression is that parity with Json.NET is much more valuable than the semantics offered by a similar feature in another platform.
Would it be possible to share a few of these and why you think we should be emulating them?
The research was done a while ago during my previous attempt on bringing naming policies to .NET: dotnet/corefx#41354 (comment).
Thanks for the link, it seems you've spent a long time studying this problem space. My takeaway from reading it is that the semantics are effectively identical to the Json.NET implementation, at least when it comes to run-of-the mill .NET member names that would get used in the 99% case (which is consistent with the findings of my fuzzing runs).
The only divergence really comes from the trimming/collapsing semantics for certain char categories. If I'm honest I can't see how this might be considered desirable behavior since (as you're pointing out in that older post) it increases the chances of collision. If its only purpose is to prettify the resultant property names then I think we can leave it out.
What Json.NET does isn't perfect. There have been a number of PRs that have contributed improvements to what Json.NET does and I haven't merged them because it's a breaking change to change property names.
Are there any specific examples you have in mind? I skimmed through the Json.NET backlog and I mostly found variations of this problem. While it would be nice to fix, I think being able to read old data persisted by Json.NET using the same model types is a valuable goal in its own right.
cc @roji who has also worked on this problem in the past.
As far as I remember @roji took the code from JSON.NET for name translations in Npgsql and then I improved it there. Then as the result of my work opened a pull request in JSON.NET which you can see linked to the issue you mentioned.
@khellang was on the same area with me previosuly, so I would call him actually.
Here's an alternative implementation that addresses https://github.com/JamesNK/Newtonsoft.Json/issues/1956 but still doesn't trim non-alphanumeric characters. It deviates from Json.NET implementation but still provides 1-1 mapping guarantees up to case insensitivity. The diff in the test file should highlight the semantic differences from Json.NET.
While searching for answer on a different question in the first pull request, I accidentally stopped scrolling on https://github.com/dotnet/corefx/pull/41354#issuecomment-554077827 which highlights that one of popular JavaScript name converters ignores punctuation while a package from Python keeps it.
That makes me think why don't support both behaviors which differ only by a switch in the implementation? That can be really valuable for all consumers and stop bike shedding around since there's no standard.
In my personal opinion I would go with trimming just because frontend development is done in JavaScript or TypeScript and interoperability without a headache is crucial here.
We can consider a offering a switch in the future, but there's no time to squeeze in new APIs for .NET 8. For now we need to make sure that we ship good defaults in the APIs already approved, and I think not skipping characters is a good default.
We should probably add a few tests validating that the output matches that of the Json.NET implementation, particularly when it comes to nontrivial inputs. We have time until .NET 8 ships, so we can always introduce breaking changes if needed.
_Originally posted by @eiriktsarpalis in https://github.com/dotnet/runtime/pull/69613#discussion_r1001603922_