Validate that the new snake/kebab case naming policies match the Json.NET implementation.

eiriktsarpalis commented 1 year ago

We should probably add a few tests validating that the output matches that of the Json.NET implementation, particularly when it comes to nontrivial inputs. We have time until .NET 8 ships, so we can always introduce breaking changes if needed.

_Originally posted by @eiriktsarpalis in https://github.com/dotnet/runtime/pull/69613#discussion_r1001603922_

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

eiriktsarpalis commented 1 year ago

cc @YohDeadfall

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.

Issue Details

We should probably add a few tests validating that the output matches that of the Json.NET implementation, particularly when it comes to nontrivial inputs. We have time until .NET 8 ships, so we can always introduce breaking changes if needed. _Originally posted by @eiriktsarpalis in https://github.com/dotnet/runtime/pull/69613#discussion_r1001603922_

Author:	eiriktsarpalis
Assignees:	krwq
Labels:	`area-System.Text.Json`
Milestone:	8.0.0

gregsdennis commented 1 year ago

Is matching JSON.Net really a concern anymore? It'll be 5 .net versions by the time this goes out, and it seems that STJ stands on its own at this point.

eiriktsarpalis commented 1 year ago

Given that it is new behavior, it stands to reason that we do some degree of due diligence, so at the bare minimum any divergences are introduced intentionally rather than accidentally.

YohDeadfall commented 1 year ago

When I worked on the policies I looked at other canonical implementations in various languages and libraries, so it's not an accidental implementation. Tests are taken from externals sources too to verify compatibility. So, I guess it would be better to verify the current behavior with different popular implementations like it's in TypeScript, Rust which can be used for frontend development too, and so on.

From my Npgsql time I remember that I tweaked the code taken by @roji or someone else before just to make more expectable results than Json.NET makes. I wouldn't align the new policies with that library.

krwq commented 1 year ago

I think we should

write down some interesting cases (common usage scenarios, cases with _, some weird F# naming, whatever we can think of where we might observe differences)
show how our implementations behave - are word splitting rules consistent, what are the differences (for the recently added we can just show 1 of them since they have exact same rules)
show how NSJ behaves, is it consistent between its own implementation? Is it behaving same as ours? When does it behave differently

then given all of the above we should decide next steps - we should really do what's most intuitive to users and that may be influenced by existing behavior or not. If it means breaking change in some obscure scenario I'm ok with that. We should have perf in mind when looking at any fixes to the current default policy - we don't want to have 5% regression in E2E scenario

eiriktsarpalis commented 1 year ago

I did a fuzzing run on the new naming policies, comparing their outputs with those of the equivalent Json.NET policies. It found a number of divergences primarily related to the handling of non-letter characters. More specifically (assuming JsonNamingPolicy.SnakeCaseLower):

Certain non-letter characters are being trimmed, for instance the name _foo will be rendered foo, a% will be rendered a and _?#__ will be rendered as the empty string. The Json.NET naming strategy preserves the original names in all cases and doesn't trim characters.
- This doesn't seem to happen if these characters lie between words, for example foo%bar will get normalized to foo_bar.
- At the same time, non-punctuation symbols are not being trimmed, for instance the name $type will be rendered $type, matching the Json.NET output.
Multiple occurrences of the separator character between words will be normalized to a single occurrence. For example, the name __foo__bar__ will be rendered foo_bar by STJ (with Json.NET preserving the original __foo__bar__ value).

These divergences are concerning for a number of reasons:

Some of these examples are valid .NET member identifiers and could be expected to be part of the JSON contract of a POCO.
Naming policies can also be applied to dictionary key deserialization, where all non-null strings are valid keys.
The trimming behavior could result in key collisions that fail deserialization or overwrite data in unexpected ways. For instance the names _foo and Foo both normalize to foo.
Divergences in the policy would make STJ deserialization of data persisted by other Json.NET applications more difficult to achieve.

While I don't have reason to believe that the Json.NET implementation is more correct than what we current have, it's probably for the best that we try to emulate it as much as possible. I'll submit a PR changing the implementation and updating/extending tests shortly.

@YohDeadfall you mentioned earlier that the current implementation was following best practices from naming policy components in other platforms. Would it be possible to share a few of these and why you think we should be emulating them?

YohDeadfall commented 1 year ago

I did a fuzzing run on the new naming policies, comparing their outputs with those of the equivalent Json.NET policies.

The world doesn't end on Microsoft and .NET with its ecosystem. While I understand that System.Text.Json is aimed to replace JSON.NET which is quite popular it's not the best thing ever. It's just what people are used to, but again, in case of cross platform communications things can be pretty screwed.

At the same time, non-punctuation symbols are not being trimmed, for instance the name $type will be rendered $type, matching the Json.NET output.

That might be unintentional since the original code was using Unicode text segmentation for word extraction and then applying a casing policy with a follow up concatenation.

Would it be possible to share a few of these and why you think we should be emulating them?

The research was done a while ago during my previous attempt on bringing naming policies to .NET: https://github.com/dotnet/corefx/pull/41354#issuecomment-554958476. Scroll up and down to see other relevant comments. What I can recall at the moment that my work was also inspired by hecks crate in Rust because it was one of popular naming converters non-utilizing regular expressions in contradiction to what I saw for Python and JavaScript.

eiriktsarpalis commented 1 year ago

I did a fuzzing run on the new naming policies, comparing their outputs with those of the equivalent Json.NET policies.

The world doesn't end on Microsoft and .NET with its ecosystem. While I understand that System.Text.Json is aimed to replace JSON.NET which is quite popular it's not the best thing ever. It's just what people are used to, but again, in case of cross platform communications things can be pretty screwed.

I would agree with that, and if there's a way we can improve interoperability with other platforms we should certainly be pursuing it. What I'm failing to understand though is how this could be achieved with a naming policy: it's a feature catering to code-first as opposed to schema-first approaches where JSON is typically either being roundtripped by the same serializer, or is called by clients explicitly using a schema that has been generated by the server's naming policy. I don't ever recall seeing a setup where, say, a Python client is calling a Java server where each peer uses its own naming policy with the expectation that the resultant JSON property names match -- that sounds like an extremely brittle configuration.

So if indeed this is a feature meant to cater to code-first .NET applications, my impression is that parity with Json.NET is much more valuable than the semantics offered by a similar feature in another platform.

Would it be possible to share a few of these and why you think we should be emulating them?

The research was done a while ago during my previous attempt on bringing naming policies to .NET: dotnet/corefx#41354 (comment).

Thanks for the link, it seems you've spent a long time studying this problem space. My takeaway from reading it is that the semantics are effectively identical to the Json.NET implementation, at least when it comes to run-of-the mill .NET member names that would get used in the 99% case (which is consistent with the findings of my fuzzing runs).

The only divergence really comes from the trimming/collapsing semantics for certain char categories. If I'm honest I can't see how this might be considered desirable behavior since (as you're pointing out in that older post) it increases the chances of collision. If its only purpose is to prettify the resultant property names then I think we can leave it out.

JamesNK commented 1 year ago

What Json.NET does isn't perfect. There have been a number of PRs that have contributed improvements to what Json.NET does and I haven't merged them because it's a breaking change to change property names.

eiriktsarpalis commented 1 year ago

Are there any specific examples you have in mind? I skimmed through the Json.NET backlog and I mostly found variations of this problem. While it would be nice to fix, I think being able to read old data persisted by Json.NET using the same model types is a valuable goal in its own right.

eiriktsarpalis commented 1 year ago

cc @roji who has also worked on this problem in the past.

YohDeadfall commented 1 year ago

As far as I remember @roji took the code from JSON.NET for name translations in Npgsql and then I improved it there. Then as the result of my work opened a pull request in JSON.NET which you can see linked to the issue you mentioned.

YohDeadfall commented 1 year ago

@khellang was on the same area with me previosuly, so I would call him actually.

eiriktsarpalis commented 1 year ago

Here's an alternative implementation that addresses https://github.com/JamesNK/Newtonsoft.Json/issues/1956 but still doesn't trim non-alphanumeric characters. It deviates from Json.NET implementation but still provides 1-1 mapping guarantees up to case insensitivity. The diff in the test file should highlight the semantic differences from Json.NET.

YohDeadfall commented 1 year ago

While searching for answer on a different question in the first pull request, I accidentally stopped scrolling on https://github.com/dotnet/corefx/pull/41354#issuecomment-554077827 which highlights that one of popular JavaScript name converters ignores punctuation while a package from Python keeps it.

That makes me think why don't support both behaviors which differ only by a switch in the implementation? That can be really valuable for all consumers and stop bike shedding around since there's no standard.

In my personal opinion I would go with trimming just because frontend development is done in JavaScript or TypeScript and interoperability without a headache is crucial here.

eiriktsarpalis commented 1 year ago

We can consider a offering a switch in the future, but there's no time to squeeze in new APIs for .NET 8. For now we need to make sure that we ship good defaults in the APIs already approved, and I think not skipping characters is a good default.

dotnet / runtime

Validate that the new snake/kebab case naming policies match the Json.NET implementation. #77309