bhftbootcamp / Serde.jl

Serde is a Julia library for (de)serializing data to/from various formats. The library offers a simple and concise API for defining custom (de)serialization behavior for user-defined types
Apache License 2.0
31 stars 7 forks source link

Make dependencies optional #18

Open har7an opened 4 months ago

har7an commented 4 months ago

Hello,

thanks for creating this amazing package! I'll use it in a new project of mine to work with JSON and, being a Rust programmer, I was missing a replacement for serde until I found this. Amazing work!

I tend to be very picky about the dependencies I pull into a project and try to keep them minimal. Since I only need JSON personally, I was wondering if the rest could be left out of the package.

This is my first attempt at writing an extension in Julia. What I did is make JSON a weak dependency and create a new module in the ext/ subfolder for this purpose. Thanks to how the package is written, I pulled out the JSON-specific SerJson.jl, ParJson.jl and DeJson.jl into that new extension as Ser.jl, Par.jl and De.jl respectively and had to change barely a thing in the process.

The Serde package now has a new file JSON.jl that checks at runtime whether the JSON extension is present and exposes the functions from there to the user. Users that don't import the JSON package will not be able to serialize/deserialize objects, but neither will JSON be pulled in unconditionally by Serde. The code isn't pretty yet, I was mostly trying to see if it works and wanted to discuss the idea with you before investing more time.

If you're interested, I'd be happy to hear your feedback and convert the other "languages" into extensions as well!

Pull request checklist

har7an commented 4 months ago

Oh and here's a REPL session to prove it basically works:

julia> import Pkg;
julia> Pkg.activate(;temp=true);
julia> Pkg.develop(;path=".");
julia> Pkg.add("JSON");
julia> import Serde, JSON;
julia> x = Serde.to_json(5)
"5"

julia> @assert 5 == Serde.deser_json(Number, x)
dmitrii-doronin commented 4 months ago

Hello, @har7an!

We are big fans of Rust here, too.

Thank you for your contribution. I really like the idea of giving options for providing your own parser. Having JSON as a weak dependency is definitely a boon, too. We should include something like that in a future release.

I have only one concern though. Wouldn't it mean that we would have to keep track of compatibility with every parser option we add to ext? Maybe an overloaded function for a parser package would be a better course of action in this case?

har7an commented 4 months ago

Hi @dmitrii-doronin,

I have only one concern though. Wouldn't it mean that we would have to keep track of compatibility with every parser option we add to ext?

I'm not sure I understand what you're referring to. Serde has to keep track of the version of the parsers it relies on at the moment in any case. At the moment that's managed with the compat table in Project.toml and I see you're not checking in a Manifest.toml either.

With my changes in place, the compat table in Project.toml still exists and is considered by Pkg when performing package resolution. The only difference w.r.t. dependency management that I see is that the dependencies declared as weakdeps aren't installed unconditionally. Everything else should basically stay the same as before.

But I'm not a seasoned Julia developer, so I may have a wrong picture there. That's how I understand the situation.

Maybe an overloaded function for a parser package would be a better course of action in this case?

How exactly would that change the "keeping track of parser compatibility" bit?

har7an commented 4 months ago

Ah, and I forgot:

I really like the idea of giving options for providing your own parser.

I'm not saying we let users provide their own parsers, that is not this PRs scope (Although I guess that technically it might be possible). I'm merely allowing users to pull in the dependencies they want, but only from the set we support. So if you want to use Serde with Json, you must pull in the JSON package we depend on in the Project.toml (with the same UUID and everything). If a user wants to use Serde with some custom MyCoolJSON package, they'll still have to patch Serde to allow that, same as before.

dmitrii-doronin commented 4 months ago

Ah, I think i get you now. So in the grand scheme of things you want to make CSV, YAML, etc. optional for a user to install?

har7an commented 4 months ago

Ah, I think i get you now. So in the grand scheme of things you want to make CSV, YAML, etc. optional for a user to install?

Exactly, IMO users should "pay" only for what they really want to have in their application.

dmitrii-doronin commented 4 months ago

That's a really good idea. I'd have to look deeper into Julia extensions, since, to be fair, I have never used this feature before. But we definitely want to merge this at some point.

@gryumov cc

gryumov commented 4 months ago

Hello @har7an! Thank you for the suggestion – I really like this idea. @dmitrii-doronin, let's assess the user-friendliness of this feature and aim to propose a solution by next week.

dmitrii-doronin commented 4 months ago

propose a solution by next week.

Great, I'll look into it. @har7an does this sit well with you?

har7an commented 4 months ago

@dmitrii-doronin I think that'll work, or else I'll let you know. I should have a solution by tuesday, but probably earlier.

I'll leave the question of whether you like my "design decisions" for later then and make a usable "prototype" first. Changing how we integrate the extensions into the main thing is comparatively little effort.

har7an commented 3 months ago

Okay I added some more changes now, mostly to restore the tests back to working order. All the tests from the "JSON" group seem to pass now, but I get failures for test case 30 from deser.jl where it checks for nothing and missing. As far as I can tell it doesn't construct the expected error type (WrongType), but raises an error somewhere else instead.

My knowledge about Julia macros and the codebase isn't really big enough to determine the cause yet. Pretty much the only change I can think of that may have caused this is in Utl/Macros.jl, where I had to patch a JSON-specific macro. I left the macro there, although one may consider moving it into the JSON extension. If we do that, however, I'm not sure how we could keep the documentation intact since afaik Documenter cannot document items from foreign packages and I don't know how extensions are handled in that regard.

Next I'll work on making all the other "markup languages" extensions as well.

har7an commented 3 months ago

Moved CSV into an extension now as well.

It just occured to me: I think it would be a good idea to define the prototypes for all the to_* and parse_* functions inside the Serde package. That way we could point to them inside the Serde documentation and I assume that LSPs would pick them up as well. We could make the default impl raise a descriptive error in case the extension is missing and otherwise, transparently forward the arguments to the respective functions inside the extensions. What do you think? And should I keep going to implement the other extensions as well right now, or do JSON and CSV suffice for the user-friendliness evaluation you had in mind?

dmitrii-doronin commented 3 months ago

As far as I can tell it doesn't construct the expected error type (WrongType), but raises an error somewhere else instead.

I've seen this issue, too. It's version dependent, unfortunately. Conversion to an output type of functions seems to throw ErrorException in 1.10 instead of MethodError. Check the tests with 1.9.4.

dmitrii-doronin commented 3 months ago

It just occured to me: I think it would be a good idea to define the prototypes for all the to_* and parse_* functions inside the Serde package. That way we could point to them inside the Serde documentation and I assume that LSPs would pick them up as well.

I like the idea. We could move all the interface functions into a singular module and then have extensions expand on them. it has recently occurred to me that we're missing 'ignore_field' for queries. Extending a singular interface to modules should alleviate this.

What do you think? And should I keep going to implement the other extensions as well right now, or do JSON and CSV suffice for the user-friendliness evaluation you had in mind?

The best course I think would be to go in small incremental steps. We can workout this PR and use it for some time to see, whether it's convenient or not. Does it mean that usage of JSON extension would force user to explicitly import JSON in every use-case?

dmitrii-doronin commented 3 months ago

Btw, you can ping me in Julia official slack.

har7an commented 3 months ago

@dmitrii-doronin

Check the tests with 1.9.4.

Jup, that works, thanks. Should I investigate that as part of this PR, or do we accept the situation for now?

We could move all the interface functions into a singular module and then have extensions expand on them.

I'm not sure we mean the same thing (yet). You can have a look at src/{CSV,JSON,YAML,...}.jl to see what I have settled on for the moment. Essentially I'm collecting all args and kwargs (as varargs) and passing them through to the extension explicitly. I tried defining the raw interface (as function foo end) and then just extending that in the extensions, but this would throw errors about unknown functions when calling them. Maybe there was some "glue" missing, I don't know. My current approach of using the Ext module I introduced has the benefit that it allows us to throw a descriptive error message when the extension isn't available, giving the user instructions how to fix that. It's not ideal yet, but we can smooth that out further, of course.

Does it mean that usage of JSON extension would force user to explicitly import JSON in every use-case?

That is correct. Afaik that's also how e.g. savefig in Plots works. At least the last time I checked it required me to import FreeType, FileIO, but apart from that I don't have to do anything with these packages.

For the tests, I "solve" this by just importing all the packages at the top of runtests.jl.

Btw, you can ping me in Julia official slack.

Thanks for the offer, but I don't have Slack and don't feel like changing that. For the development here I'd prefer communication in this PR so "external" users can follow what was done and why. :)

I have now moved all the bits that need dependencies to extensions (JSON, YAML, TOML, XML, CSV). Query doesn't really qualify as an extension since it doesn't have an external dependency, so maybe we can compact the Query bits into a single file, wdyt? On the same topic: The src/{De,Ser,Par} folders are now pretty empty. We could compact/flatten their contents into files perhaps? I'll leave that decision up to you.

The package in its current state, when imported, pulls in a total of 4 dependencies:

Afaik all of these are part of the Stdlib, so they don't require external code at all. All the rest is loaded "on-demand", when the dependencies for the extensions are imported.

henrik-wolf commented 3 months ago

Hey there! I was playing around with this package a bit and came here to see if there is already some work going into making the parsers optional. Glad to see this already tackled!

From my (little) experience / what I read and heard about package extensions I would like to offer a bit of information about how they are supposed to work. I believe this could be helpful to remove the fairly complicated (and possibly hacky...) infrastructure about exposing the extension functions from the main package.

package extensions are supposed to extend on already existing functions rather than exporting new symbols. The docs have a nice example for this: https://pkgdocs.julialang.org/v1/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions)

The general workflow for the extension should thus be that all functions you want to extend are defined in the main package, either as simple as

module Serde
function to_json end
function from_json end
...
end

or with a default method which throws an informative error:

module Serde
to_json(args...; kwargs...) = error("please import the JSON module...")
from_json(args...; kwargs...) = error("please import the JSON module...")
...
end

and then in the extension you add more specific methods to the respective functions:

module SerdeJSONExt

using Serde  # get access to barebones functions
using JSON

function Serde.to_json(...)
# do things
end

function Serde.from_json(...)
# do things
end

When you want add methods to a function which is owned by another package, you either have to explicity import it as import Serde.from_joson or put the owning package in front of the definition. Otherwise, you just define a new function in this module which happens to have the same name... (I ran into this multiple times in the past...)

har7an commented 3 months ago

Hi @SuperGrobi,

thanks for sharing your insights! I already noticed that the current approach doesn't really work, unfortunately. It fails for example when trying to override the serialization behavior for a custom type. So I'll try to rebase the code and then pull out more potentially interesting functions into the base package, so we only override existing things in the extension. I'll keep you posted here.

henrik-wolf commented 3 months ago

Ah. I think I see the problem now. The relevant functions are currently under Serde.SerJSON.ser_value, so you would have to define the methods in Serde.SerJSON and then overwrite them in the extension?

I just had another idea, which might or might not work with the whole package architecture... in theory it should be possible to dispatch on the modules used for serialisation:

ser_value(m::Module, ::Type{T}, ::Val{x}, v::V) where {T,x,V}) = ser_value(Val(m), T, x, v)

and then have users extend for JSON

ser_value(::Val{JSON}, ::Type{MyType}, Val{MyFieldName}, v::MyFieldType) = ...

(just an idea. this breaks if there is no external dependecy for serialisation.)

har7an commented 3 months ago

@SuperGrobi I've had the same thought already, but I'm indecisive which is better. My thought was that, since we have the "module" names included in all the public functions already (i.e. to_json, parse_json, ...), we may as well stick with how it works right now. This way it will be less surprising for users upgrading to this version (although this change probably justifies a major version increase in any case).

If I were to write the code from scratch I'd probably go for your solution, though.

henrik-wolf commented 3 months ago

Fair. That would come pretty close to a full rewrite.

har7an commented 3 months ago

JSON and YAML should be back in a working state now, the rest I'll take care of another day. I've moved the SyntaxError-variants into the public API of the Serde base package now, because I consider them to be part of the public API. This has the direct consequence, that the exception field of YamlSyntaxError is now of type Any, since YAML.ParseError would require us to pull in the YAML dependency (which is exactly what I'm trying not to do) and isn't a subtype of Exception either, unfortunately. I hope that's okay!

If you disagree with my decision please let me know, but then we'll have to introduce a different, opaque error mechanism.

har7an commented 3 months ago

So I've just tried to integrate the current state of the package with an application of mine, but it doesn't work as I expect. I'm at a loss with the dispatch mechanism and how methods for functions are actually chosen. I think that with the current design, my approach doesn't work. :(

I'll try a few more things with dispatching on modules to see whether what I had initially intended is possible at all. I'll keep you updated.

dmitrii-doronin commented 2 months ago

Hi, @har7an and @SuperGrobi. You're both doing amazing work here. I am preoccupied with other tasks right now, so I can't try tackling this issue currently. However, I know this package pretty well. So, if you need any insight into the inner workings of the package, just give me a ping here and I'll do my best to help you out with it.

dmitrii-doronin commented 2 months ago

@har7an @SuperGrobi I gave it a thought and the best course of action, IMO, is to have own parsers inside Serde. We a) are not breaking the current interface and b) allow for the extensions without major rewrites. What do you think?

har7an commented 2 months ago

Hey @dmitrii-doronin,

sorry for the delay, I got a little sidetracked the last weeks. I'm not sure about that. Doesn't that entail effectively rewriting everything that's already done in JSON.jl, YAML.jl, ... ?

Today I did a little exercise with dispatching on Val{...} and it looks like a viable approach. My idea is that we implement a "generic" to_string and from_str(ing) (and possible also to_writer and from_reader) which take a Module type as first argument in the public API. Then inside that function we convert the Module to a Val (thanks for the tip, @henrik-wolf), which we can in turn implement in the package extensions.

If an extension isn't loaded (due to e.g. the dependency not being loaded), it will throw a MethodError (dispatching fails due to unknown signature). So I'd like to catch that MethodError and then display an error message explaining the situation. I'm honestly no fan of exceptions because I don't see how you can tell where the MethodError happened, but I'll see what I can do to determine whether it failed on dispatch. In my little toy example this already (sort of) works.

Then we can decide what to do: We can either keep the current interface and redirect each of these calls. For example: to_yaml then becomes to_string(YAML, ... internally (if we say we want to stick with this interface for the time being). Or we deprecate these "old" functions via the @deprecate macro if we want to nudge users to using the new API.

I'll go ahead and implement this today to see whether it can actually work. My plan is to rewrite the YAML submodule at least (because I already have code that leverages this particular functionality). I'll let you know when it's done.

I see that quite a bit of code has been added since I last rebased. I'll first try to make this version of the code work (somehow) and then think about incorporating all the new code, I think. This is gonna be fun...

har7an commented 2 months ago

TOML and YAML are now reimplemented, the tests still work. I'll implement the rest later today and then give you a quick rundown of how it works and what it looks like.

har7an commented 2 months ago

Okay, the rebase is done. The tests all pass for me.

I added the version from this branch as dev-dependency to an existing project of mine (which also overwrites serialization for a type of mine) and I had to change a few things. The changes mostly relate to the fact that where previously we had e.g. the submodule SerYaml, we now only have the module YAML which defines all the function prototypes.

So essentially, for migration, afaict, we have to do: (Ser|De|Par)Yaml -> YAML and users will have to import the respective modules in their code (also in test code). They can then either use the previous functions (to_yaml) or the new API (to_string(YAML, ...) as they wish.

What's missing still (next to the other TODOs mentioned below) is a way to show pretty error messages like "hey there, you probably meant X, but to do that you must import Y in your code". I'm not familiar enough with exception handling to tell apart a MethodError raised from a failed dispatch immediately in the calling function from another MethodError raised further down the call stack. So if anyone can give me hints in that regard, I'd be very grateful.

TODOs

artememelin commented 1 month ago

Hi @har7an , I see that some really great work has been done on the package! I went through the changes and I like what I see!

Here are some thoughts on what we have now:

Regarding error printing, I can suggest the following:

Determine the new error type SerdeExtensionError or something like that

struct SerdeExtensionError <: Exception
    extension::Symbol
end

function Base.show(io::IO, e::SerdeExtensionError)
    return print(io, "SerdeExtensionError: to use this method, first you need to import the $(e.extension) module.")
end

Then add to the methods to_string, from_string, parse a check that the required module is imported into Main

function to_string(module_name::Symbol, args...; kwargs...)
    return if module_name in names(Main, imported = true)
        to_string(Val(module_name), args...; kwargs...)
    else
        throw(SerdeExtensionError(module_name))
    end
end

function from_string(module_name::Symbol, args...; kwargs...)
    return if module_name in names(Main, imported = true)
        from_string(Val(module_name), args...; kwargs...)
    else
        throw(SerdeExtensionError(module_name))
    end
end

function parse(module_name::Symbol, args...; kwargs...)
    return if module_name in names(Main, imported = true)
        parse(Val(module_name), args...; kwargs...)
    else
        throw(SerdeExtensionError(module_name))
    end
end

We had to make small changes to the function signature, namely, change the Module type to Symbol.

Now we have the main functions through which all other methods from other modules must pass.

Also, for this we will have to change the passed values from Val{:JSON} to :JSON inside the main methods from modules JSON, CSV, ... , like

function to_json(args...; kwargs...)
    Serde.to_string(:JSON, args...; kwargs...)
end

Regarding the documentation of new API functions:

I'm not sure if this is necessary, since it seems to me that these methods serve only as a layer between the (de)serialization/parsing methods and their implementation in the extension

henrik-wolf commented 1 month ago

Hey! @har7an thank you for the great work!

@artememelin Do we want to keep the original api with the ser_json, ser_xml... calls? If we remove them, we would get around most of the problems, as you simply will not be able to accidentally call methods for which you need the respective package... (a similar problem exists if we do the conversion of module to symbol. Nothing is stopping me to just call something with Val(:json), whereas I can not call something with Val(JSON) if I do not have the right module available.)

Similarly, adding custom serialisation and deserialisation behaviour would then simply require the user to dispatch on ser_value(Val{JSON}, ::Type{T}, ::Val{x}, v::V) where {T,x,V} rather than SerJson.ser_value(::Type{T}, ::Val{x}, v::V) where {T,x,V} = v

(I think this would allow us to remove any notion of the specific parser/writer package from the main package. This way, it would be pretty easy for a user to bring their own parser/writer package to integrate with Serde.jl)

har7an commented 1 month ago

@artememelin Thanks for the feedback!

I think that @henrik-wolf has a good point, actually. The way things currently are, we cannot call e.g. to_string for an extension that doesn't have it's required package loaded, simply because Julia will turn to_string(JSON, ...) into an "UndefVarError" exception. Turning the argument from Module to Symbol allows users to invoke e.g. to_string(:Json, ...), which looks almost right except for casing. I think that having this "freedom" in the function argument will require us to explain this at least to users new to Julia as language.

I like the proposal for the error handling though. I went ahead and modified it slightly: Since to_string/from_string/parse are pretty much fool-proof (and, as mentioned by @henrik-wolf, offer a neat interface for user extensions), I moved the error handling into each markup-specific submodule. There's now a function if_module that works as proposed by @artememelin which still uses the same symbols as before internally, but doesn't expose any of this to the user.

So now the state of affairs is as follows:

When a user calls the new to_string/... without the required module:

julia> Serde.to_string(JSON, Dict("a" => 1))
ERROR: UndefVarError: `JSON` not defined
Stacktrace:
 [1] top-level scope
   @ REPL[3]:1

When a user calls the previous JSON.to_json without the required module:

julia> Serde.JSON.to_json(Dict("a" => 1))
ERROR: SerdeExtensionError: to use this method, first you need to import the 'JSON' module.
Stacktrace:
 [1] if_module(f::Main.Serde.JSON.var"#2#3"{@Kwargs{}, Tuple{Dict{String, Int64}}}, mod::Symbol)
   @ Main.Serde /var/home/hartan/repos/github.com/bhftbootcamp/Serde.jl/src/Serde.jl:70
 [2] to_json(args::Dict{String, Int64}; kwargs::@Kwargs{})
   @ Main.Serde.JSON /var/home/hartan/repos/github.com/bhftbootcamp/Serde.jl/src/JSON.jl:109
 [3] to_json(args::Dict{String, Int64})
   @ Main.Serde.JSON /var/home/hartan/repos/github.com/bhftbootcamp/Serde.jl/src/JSON.jl:108
 [4] top-level scope
   @ REPL[4]:1

What do you think about that everyone?

Given the feedback so far I assume that the to_string/from_string/parse functions are acceptable to you. I'll go ahead (soon-ish) and document these since I think that with the current proposal, there's a lot of value to end-users w.r.t.

  1. Being able to provide your own methods/"extensions" in your codebase, having a uniform API for all things that can (de-)serialize (although this might need some more exploration first)
  2. Having a neat programmatic API to (de-)serialize to any format the users wants and that works, rather than limiting this on the code authors side or writing things like
    if target === :json
       Serde.JSON.to_json(val)
    elseif target === :yaml
       Serde.YAML.to_yaml(val)
    else
       ...
    end

Regarding the API for Query I'm not sure what to do about that. I guess it pretty-much falls into the category of "bring your own module/extension" as mentioned above. I'm sure I'll figure something out, though...

har7an commented 1 month ago

Okay I think this PR is now pretty much done from my side. I sat down and had a thought about how we could enable "arbitrary" extensions from e.g. users. You can find a demo of that in test/extensions.jl. This testfile also makes sure that the public APIs (both the previous and the one introduced here) yield the same results with a little toy example.

The Query is now also implemented using this "extension" mechanism. Since it doesn't have an external dependency, I simply implemented it on the Serde.Query module. So you call it like this:

Serde.to_string(Serde.Query, Dict("a" => 1))

The new functions are now documented and live in a file API.jl. I'm not good with names, and that one at least seemed to fit what I had in mind...

I have moved around all the other files, too:

If you want any of this undone of renamed, please let me know. Also I'd like to hear your feedback on the new code!

har7an commented 1 month ago

I tried integrating this into some code of mine and I noticed that, since we're using extensions now to provide YAML/JSON/..., users can no longer override the Ser${LANG}/${LANG}_value function at will.

I guess technically these functions have never been part of the public API (they aren't exported, nor documented, nor mentioned in the docs) so we actually achieved encapsulation in Julia. I could imagine that more users than only me sort of use this in their own code, though...

I think in the long run it would be neat to have something like a serialize_with, as Rusts serde crate has. Then users can bring their own transformation functions to modify values as they see fit before we dump them into some IO buffer. But given the size of my PR as it currently is, that's probably out of scope for here. :(

Or do we already have such a thing and I'm unaware of it?