OptimumCode / json-schema-validator

The JSON schema validation library that works with https://github.com/Kotlin/kotlinx.serialization
MIT License
33 stars 3 forks source link

Enhancement: Support validating kaml's `YamlNode` #190

Open Vampire opened 1 month ago

Vampire commented 1 month ago

Is there an existing issue for this?

Enhancement description

Like the built-in JSON format, there is https://github.com/charleskorn/kaml for parsing YAML using KxS. Like the JsonElement of the JSON format, there is YamlNode there. Would be very nice if you could support validating those too.

Sub-tasks

OptimumCode commented 1 month ago

Hi, @Vampire. I appreciate your interest in the library.

I had a similar use case where I had YAML files and JSON schema and needed to validate those YAML files (and it was a multiplatform project). I ended up converting YamlNode to JsonElement (will leave the code here just for reference)

YamlNode to JsonElement ```kotlin private fun YamlNode.toJsonNode(): JsonElement = when (this) { is YamlList -> buildJsonArray { items.forEach { add(it.toJsonNode()) } } is YamlMap -> buildJsonObject { entries.forEach { (key, value) -> put(key.content, value.toJsonNode()) } } is YamlNull -> JsonNull is YamlScalar -> content .toBooleanStrictOrNull() ?.let(::JsonPrimitive) ?: content .toDoubleOrNull() ?.let(::JsonPrimitive) ?: JsonPrimitive(content) is YamlTaggedNode -> innerNode.toJsonNode() } ```

I need to think about how to make this possible without breaking the API much and without ending up with two different paths for JsonElement and YamlNode. And probably, it also should be possible to load JSON schema defined in YAML format - I saw some cases like that

Vampire commented 1 month ago

Actually I tried to do something similar but hit the problem that YamlScalar lost type information, so you for example don't know whether you had 1 or '1' in the YAML. :-(

Maybe one viable option could be to slightly broaden the scope of this lib and support validating Kotlin object tree of Map, List, and so on.

What I do right now is using https://github.com/krzema12/snakeyaml-engine-kmp (which kaml uses under the hood too) to parse/load the file, then transform the result to JsonElement and give it to your lib.

But would be awesome if I could give the load result directly even if it is not coming from KxS.

You could either transform to JsonElement and use a common code path, or transform JsonElement to the stdlib classes and use those in the common code path. That would also allow to make the JSON dependency optional / a feature variant.

OptimumCode commented 1 month ago

the problem that YamlScalar lost type information, so you for example don't know whether you had 1 or '1' in the YAML

I think the problem might be partially solved if the kaml (and probably SnakeKMP - I think it already does) would expose the information about ScalarStyle and block style. In this case, there would be a way to identify if the scalar is 100% a string when it has the ScalarStyle other than Plain or has a block style (am I missing something here?).

But I have some doubts about block style. For example, is it a valid number?

key: >
  10

However, it will also require providing some API to ask the element: can you be a number? can you be a boolean?

This part should be implemented here, inside the validation library, - currently, it relies on isString property for the JsonPrimitive class to check whether it can be a number/boolean and only after that tries to parse the value as the corresponding type.

If my understanding is correct, for a YAML scalar it would be like:

  1. Am I quoted? Then I am a string
  2. Do I have a block style? Then I am a string
  3. Can I be converted to number/boolean? Then I am a number/boolean
  4. Otherwise, I am a string

@Vampire what do you think of the above? Am I missing something? @krzema12 I would much appreciate you could also take a look at the above - you should have more experience in YAML than I do

krzema12 commented 1 month ago

I'm not an expert in YAML (although I tend to work on projects touching YAML 😅). From my experience, when I was trying to fix various YAML-related issues, I played with various modes of emitting the values in kaml/snakeyaml-engine. That's why I'm afraid certain assumptions may be valid for most cases, but invalid for some edge cases.

Unfortunately I don't have enough capacity to dive deeper into this problem now, sorry! Maybe @aSemy has some thoughts as he ported a much bigger portion of snakeyaml-engine to snakeyaml-engine-kmp.

Vampire commented 1 month ago

kaml would probably need isString exposed on YamlScalar, which it could get from snakeyaml-engine-kmp already. kaml's maintainer would be open to a PR adding that according to https://github.com/charleskorn/kaml/issues/303#issuecomment-1176094242. Then the YamlScalar could probably be transformed correctly to JsonPrimitive or the stdlib types.

For now I switched to using snakeyaml-engine-kmp directly to parse the YAML and then transforming the result to JsonElement, though it would of course be nice to just give in the parse result as mentioned above. :-) The parsing is done using Load().loadOne(types).toJsonElement() with toJsonElement being

private fun Any?.toJsonElement(): JsonElement {
    return when (this) {
        is Map<*, *> -> JsonObject(entries.associate { (key, value) -> "$key" to value.toJsonElement() })
        is List<*> -> JsonArray(map { it.toJsonElement() })
        is Set<*> -> JsonArray(map { it.toJsonElement() })
        is Boolean -> JsonPrimitive(this)
        is Number -> JsonPrimitive(this)
        is String -> JsonPrimitive(this)
        is ByteArray -> JsonPrimitive(encodeBase64())
        null -> JsonNull
        else -> error("Unexpected type: ${this::class.qualifiedName}")
    }
}

I've sent the whole yaml-test-suite through this and all YAML files that did not produce a parse error were transformed to JsonElement without "Unexpected type" error.

Coming from snakeyaml-engine-kmp, the Number instances are only either Int or Double.

The only thing in YAML it seems that has no match in JSON is the byte array which in JSON then just is Base64 encoded String according to the yaml-test-suite.

So if you would indeed go the "slightly broaden scope" approach mentioned above, the types from my when should be the ones you would check instead of the Json... types and what you would transform the Json... types and Yaml... types to for the common codepath. :-)

OptimumCode commented 1 month ago

UPD: corrected the links

Hi, @Vampire. I would much appreciate it if you had a chance to look at a few files in the PR from an end-user perspective - I am not the one who uses the library and might be missing something:

I think that should be enough to support YAML (once kaml exposes everything needed) and also to validate the result of SnakeKMP parser (from your example above).

OptimumCode commented 1 month ago

To be honest, I don't see any actual way to support the Kaml at the moment - an attempt to add isString property to scalar breaks backward compatibility (I did not figure out an option that will not break any code...)

What I can do is create a new module in this repo with its own YamlNode hierarchy - does not look like a lot of code if I use SnakeKMP as a parser. In this case, all required information will be provided and validation will be possible.

Alternative way (the only one I can see right now) - rework the YamlNode in Kaml project:

This will definitely break someone else's code but after that, it should be much simpler to make changes to the YamlNodes

OptimumCode commented 1 month ago

The abstraction interface will be available in 0.3.0-SNAPSHOT version

Vampire commented 1 month ago

Hi, @Vampire. I would much appreciate it if you had a chance to look at a few files in the PR from an end-user perspective - I am not the one who uses the library and might be missing something:

I've had a cursory look. I'm not a Kotlin expert, so maybe I just don't have the right perspective how you build good and idiomatic API in Kotlin. Actually, to me it seems strange that null is represented by PrimitiveElement. Also, that you have val number: Number? but then semantically restricting it to only return a Long or Double or null. Also why at all is there content and isNull and contentOrNull? If you want to represent null with PrimitiveElement, why not simply making content nullable and removing the other two? I also wonder why ObjectElement does not simply extend Map<String, AbstractElement>, but you might have your reasons. Or why ArrayElement is not extending Array<AbstractElement> or Iterable<AbstractElement> or whatever is appropriate.

But again, I'm not a Kotlin expert and probably just don't know how you idiomatically design an API in Kotlin. Maybe @krzema12 has better feedback. 🤷‍♂️

To be honest, I don't see any actual way to support the Kaml at the moment - an attempt to add isString property to scalar breaks backward compatibility (I did not figure out an option that will not break any code...)

Well, yeah, that's the price you have to pay when you put data classes into your API. But as the maintainer was open to have isString added, maybe he is ok with it being a breaking change? Actually, you usually can do it in a non-breaking way, it is just not really convenient. You need to add it as last constructor argument so that the componentX methods do not break behaviourally or binary, add a secondary constructor with the signature of the old constructor, and add a copy method with the old signature. All this voids the benefits you get from a data class and thus it is a bad idea to have it in your API.

rework the YamlNode in Kaml project

Well, you have to talk to kaml maintainer what he would prefer. :-) Maybe he also simply is fine with a breaking change, who knows.

OptimumCode commented 1 month ago

Thank you for taking the time to check the changes @Vampire.

Also, that you have val number: Number? but then semantically restricting it to only return a Long or Double or null

The more I look at this property the more I agree with you. Looks a bit strange. The original idea was to have longOrNull and doubleOrNull properties and use them internally to achieve the same behaviour as provided by number property. Probably, I should return to that original idea (while the interface has experimental annotation) - at least it is not as confusing as the current option.

Also why at all is there content and isNull and contentOrNull? If you want to represent null with PrimitiveElement, why not simply making content nullable and removing the other two?

Did not think about it. Maybe you are right and it will look better (1 property instead of 2)

Actually, to me it seems strange that null is represented by PrimitiveElement.

The idea was to make it similar to JsonElement hierarchy - it consists of 3 main classes JsonObject, JsonArray and JsonPrimiteve. JsonPrimitive consists of JsonNull and JsonLiteral but this additional hierarchy seemed to be a bit redundant here - having the additional class requires additional checks in most of the JSON schema assertions. So, I decided to hide null and primitive values behind a single type

I also wonder why ObjectElement does not simply extend Map<String, AbstractElement>, but you might have your reasons.

The only reason for that is Map interface has 8 methods/properties that should be implemented (3 of them are not needed at all).

Or why ArrayElement is not extending Array or Iterable or whatever is appropriate.

Unfortunately, the Array class is final. Iterable interface is exactly the same as Sequence interface (at least at the moment). I chose the newer one but ArrayElement can probably extend both interfaces - it shouldn't be a problem

With regards to the Kaml - the main problem is the functional compatibility. By adding a new property to the data class you also change the equals and hashCode methods. It means that if two YamlNodes were equal before they might not be equal after that change (in Kaml itself 27 tests fail if I use a default value true or 34 tests fail if I use false - https://github.com/charleskorn/kaml/issues/303#issuecomment-2322885579)

Vampire commented 1 month ago

Ah, I see. Well yeah, before the scalar node did not know whether something originally was string or not, so naturally things were equal that are not equal anymore as the type does no longer loose that information. I guess that is and has to be accepted when adding isString property which the maintainer was open to. 🤷‍♂️ We'll see what the owner answers.

OptimumCode commented 1 month ago

Please, be aware that the number property was split into two properties: longOrNull and doubleOrNull

Vampire commented 2 weeks ago

Oh, I just found another potential sub-task / integration. :-) SnakeYaml-KMP can not only give List, Set, and so on. It also has the class Compose with which you can instead parse into a node hierarchy similar to KAMLs nodes, but with proper information I think, at least this suggests it: https://groovyconsole.dev/?g=groovy_4_0&codez=eNqVjLsOwjAMRfd8RdTFjdRaPDbEgMTAwsYIDAEsFJo4VZqC4OtJ1YLKiBff5Pjc1SboUw4mYhVe5Aw3lVk0rCt6amdL4qthKitXl7e7W8xxgjNQwrjahyh_LPxa2FuYLNS1Qesflu5kce2T19Cf-tbry45iNHxthGB6yKEnHxM8tcZeKOSqT7lSeB7uAEDINKWcDjubZn0a3txa-0FdzkRyFEa_iyG1p9JAtdXn1LWHQsL-wN01jP6LZQeKHizH5NiBAx9BvQFWXXOk :-)

OptimumCode commented 2 weeks ago

Thank you for sharing this with me @Vampire. I think this should be a part of this sub-task: https://github.com/OptimumCode/json-schema-validator/issues/195