fsprojects / FSharp.Data

F# Data: Library for Data Access
https://fsprojects.github.io/FSharp.Data
Other
806 stars 288 forks source link

A way forward with the JsonProvider #1478

Open mlaily opened 1 year ago

mlaily commented 1 year ago

Hello!

The more I use the JsonProvider and the more I work inside its code base, the more I feel it needs a big update.

I think some kind of "strict" mode with less (surprising/bad) magic is much needed.

(For instance, inferring empty strings as null made sense with other providers that don't have types other than strings, but I don't think it makes much sense with json. In the same vein, multiplicity inference for arrays doesn't feel natural at all with json...)

I also think it might be a good idea to use System.Text.Json types as the erased implementation instead of the custom parser it currently uses.

This is easier said than done of course, but first this raises a few questions.

What about backward compatibility?!
What would be the preferred approach for such a venture?

I'm thinking it might be best to create a new provider instead of trying to fit these changes into the existing one.
What's your opinion about that?

And generally speaking, do you have any recommendations regarding this?

Thanks.

Numpsy commented 1 year ago

I'm thinking it might be best to create a new provider instead of trying to fit these changes into the existing one. What's your opinion about that?

fwiw, I've seen comments elsewhere about breaking the type providers for different data types into different packages so maybe that's a place to start for that, otherwise just creating something new might help to avoid any legacy stuff altogether?

mlaily commented 1 year ago

I went ahead and started a JsonProvider2 implementation.
https://github.com/fsprojects/FSharp.Data/compare/main...mlaily:FSharp.Data:jsonprovider2?expand=1

Here are some progress screenshots and side-by-side comparisons with the original JsonProvider...

An array is an array. It does not depend on the number of items it contains: image

An empty string is still a string, not an optional value: image

0, 1, or "yes" are not booleans: image

goswinr commented 1 year ago

@mlaily I love this work! Is your JsonProvider2 working ? I guess these changes are not merging here soon. Could you publish your own nuget?

mlaily commented 1 year ago

@goswinr thanks for your interest.

This new json provider is still a prototype: there are a lot of things that are still changing or are not tested properly and can crash.

My branch is available but it's way too early for a release.

Handling null or missing values properly is proving challenging, but I'm learning a lot and making progress.

Ideally, I would also like to use System.Text.Json, but for now all my changes are based on the existing JsonValue parser.

I'll worry about how to release it when it works well enough for production use :)

dlidstrom commented 9 months ago

I just learnt about empty strings being inferred as null. Using this file as a sample:

{
  "Lines": [
    "             **",
    "              *",
    "               ",
    "               ",
    "               ",
    "               ",
    "              *",
    "              *",
    "    *****     *",
    "   *******     ",
    "    *****     *",
    "              *",
    "               ",
    "               ",
    "               ",
    "               ",
    "               "
  ]
}

Now reading this file using the provider results in an array for Lines being:

[|"             **"; "              *"; "              *"; "              *";
  "    *****     *"; "   *******     "; "    *****     *"; "              *"|]

Very surprising and interesting to me that I now find this newly created issue. Are the nuances of the JsonProvider documented somewhere? For example, I don't find any mention of null in the reference documentation.

goswinr commented 9 months ago

@dlidstrom does inferenceMode = InferenceMode.NoInference help? in type MyJson = JsonProvider<myFilePath, InferenceMode = InferenceMode.NoInference>

dlidstrom commented 9 months ago

@goswinr It doesn't seem to help. Here is my test program and files:

#r "nuget: FSharp.Data"

open FSharp.Data

type Provider = JsonProvider<"sample.json", InferenceMode = InferenceMode.NoInference>

let sample = Provider.Load "test.json"

// show the Lines from json value and from generated type
printfn
  "%A\n%A"
  (sample.JsonValue.Properties() |> Array.find (fst >> ((=) "Lines")) |> snd)
  (sample.Lines)

sample.json:

{
  "Lines": [
    ""
  ]
}

test.json:

{
  "Lines": [
    "             **",
    "              *",
    "               ",
    "               ",
    "               ",
    "               ",
    "              *",
    "              *",
    "    *****     *",
    "   *******     ",
    "    *****     *",
    "              *",
    "               ",
    "               ",
    "               ",
    "               ",
    "               "
  ]
}

It looks like the provider reads all lines, as can be seen when inspecting the JsonValue, but "provides" only a subset. Output:

PS C:\> dotnet fsi .\Scratch.fsx
[
  "             **",
  "              *",
  "               ",
  "               ",
  "               ",
  "               ",
  "              *",
  "              *",
  "    *****     *",
  "   *******     ",
  "    *****     *",
  "              *",
  "               ",
  "               ",
  "               ",
  "               ",
  "               "
]
[|"             **"; "              *"; "              *"; "              *";
  "    *****     *"; "   *******     "; "    *****     *"; "              *"|]
Thorium commented 7 months ago

Yes it seems MS puts quite an effort for System.Text.Json (also Newtonsoft would probably be faster than custom implementation).

System.Text.Json source code is open (https://github.com/dotnet/runtime/tree/main/src/libraries/System.Text.Json/src/System/Text/Json) and one of the maintainers @eiriktsarpalis has been active in F# community. I don't know if there would be any quick wins without rewriting a lot, and do we even want dependencies like System.Memory to FSharp.Data (to use Spans).

Thorium commented 7 months ago

His advice was:

just use Utf8JsonReader under wraps. That way make sure it's compliant and also take advantage of all the vectorization code it's using under the hood.

This makes sense: The current FSharp.Data parser JsonValue (see: https://github.com/fsprojects/FSharp.Data/blob/main/src/FSharp.Data.Json.Core/JsonValue.fs#L39) would actually map the System.Text.Json Utf8JsonReader values (just testing here: https://gist.github.com/Thorium/09a67ce08adee4fcc02ec7f0048e6962). So the parsing could be replaced without re-building the whole domain model.

And now that FSharp.Data is split to multiple NuGet packages, maybe a dependency of System.Text.Json to FSharp.Data.Json.Core would be acceptable?

mlaily commented 7 months ago

I haven't worked on it for a while but I actually have started to use System.Text.Json on my branch.

I opted to use the JsonNode api instead of the JsonDocument one. https://learn.microsoft.com/en-us/dotnet/standard/serialization/system-text-json/use-dom

It seemed the most fitting for a balance between (user)-convenience and performance for a general use library.
(Keep in mind we want to be able to serialize json, not only deserialize it, and IMO an F# immutable wrapper over a mutable DOM is perfect. It's also very close to what is done with the XmlProvider and I think the two of them having a similar feature set is a good idea.)


The parsing somewhat works (not for all cases yet I think), but serialization is currently non-existent.

One big problem I had that I decided to ignore for now is that I didn't manage to make the dependency on System.Text.Json work everywhere when using the compiled provider.
In my tests it works with Visual Studio Code without issue, but I can't figure out how to make it work with Visual Studio! (IIRC a project using the compiled provider referencing STJ fails to compile, complaining that it doesn't find STJ or something similar)

Another thing is my JsonWrapper object backing every provided type after erasure is currently a struct. My reasoning is that this JsonWrapper is only there because there are members on JsonNode I don't want to expose, and if I could I would use JsonNode directly as the backing type for provided types, so I want to make it have the smallest footprint possible. I'm not 100% sure it's the right call...

Anyway if you are curious I pushed my current version to my branch (still in a somewhat PoC-state):

(side note: the JsonNode design makes it hard to use properly, e.g. because it uses null to represent json null tokens, and plenty of other strange decisions, but it still seems to me like the most fitting api to use for our case)

Thorium commented 7 months ago

I'm trying to get this working. So far:

Now I'm this far:

Error   FS3021  Unexpected exception from provided type 'FSharp.Data.JsonProvider2,Sample="{\"x\": 1}"' 
member 'GetMethods': The type provider 'ProviderImplementation.JsonProvider2' reported an error: 
The design-time type 'System.Text.Json.Nodes.JsonNode' utilized by a type provider was not found in the target reference assembly 
set '[tgt assembly FSharp.Core, Version=8.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a;
 ---<cut list of assemblies...but not System.Text.Json here>---
]'. You may be referencing a profile which contains fewer types than those needed by the type provider you are using.
mlaily commented 7 months ago

I forgot to mention another interesting thing about this: if you dotnet build a test project referencing FSharp.Data containing my version based on STJ, it works, and then VS is able to build it too! (at least until you clean your bin output folder...)

Thorium commented 7 months ago

As far as I know:

System.Text.Json is released on multiple frameworks: net462 / netstandard2.0 / net6.0 / net7.0 / net8.0

TypeProvider uses the runtime-code on compile-time. Different IDEs may use different versions (e.g. VS2022 may be running on net462, meanwhile VSCode can be running on net6 or whatever).

After compilation is the easy part, because if the customer referenced the System.Text.Json to their project (or other dependencies did it), then it probably picks the correct runtime-version anyway.

Thorium commented 7 months ago

I tried to play with this version but I think it goes very complex to try to expose the System.Text.Json types on design-time.

I think what I would rather prefer, is just to use the current "obsolete" JSON-model mapping and add 2 methods:

And so, not to expose System.Text.Json types to outside at all. Design-time would work as it is, no System.Text.Json, and runtime user would have option to use faster serialization if needed. Keep backward compatibility. Just map current Json types to Utf8JsonReader types with simple match clause. And, actually, if even more independency wanted, the System.Text.Json.dll could be loaded with reflection, so that if it's missing, the methods would just throw an error, that way there would not be dependency needed to Nuget-packet level at all to the actual library.

mlaily commented 7 months ago

I think exposing the underlying JsonNode is a desirable feature. The providers are not perfect, and it's very valuable as a user to be able to get full control and bypass the provider if you need it, and if the provider is based on STJ I would expect to get control of the STJ primitives, and not just some opaque wrapper. (Same as the XmlProvider)

I don't like the idea of choosing an implementation based on a method name but without actually seeing anything else change.
Also, throwing at runtime if STJ is missing seems worse than failing at build time...

I agree adding STJ as a dependency of FSharp.Data isn't that great, but I don't see a good path around that...

Thorium commented 7 months ago

hmm, maybe the ideas are both doable..

Maybe the "faster serializer" I was thinking (not as good as raw STJ as you pointed out) could be a separate nuget package depending both fsharp.data and system.text.json. That way current old provider can stay as it is, and old projects could benefit with minimal migration.

mlaily commented 7 months ago

That would probably be ideal if we could make the providers independent yeah.

Even though I'm aware some work has already been done to go in that direction, I'm not sure how easy it will be though.