RumbleDB / rumble

⛈️ RumbleDB 1.22.0 "Pyrenean oak" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
http://rumbledb.org/
Other
213 stars 82 forks source link

JSON Schema support #927

Open jsommr opened 3 years ago

jsommr commented 3 years ago

Having a way to create custom types would enable auto generation of eg. OpenAPI 3 specifications based on static code analysis.

Say we have a query like this:

declare variable $optional as string? external := ();
$optional

We know that the input is an optional string, and that string? is returned. But what if we have

declare variable $obj as object? external := ();
{ "obj": $obj }

Then we loose the rest of the type - say that { "test": 123 } is passed into the query, we can't analyze it and know if that's valid or not.

I see there is a branch about static types, which seem to do some pretty cool improvements on the builtin types, and as part RumbleML, annotate was introduced, but are there any plans to support something like how TypeScripts annotates objects and arrays, or even better, JSON Schema?

Edit: The example in listing 1.23 on this guided tour for XQuery seems like a pretty good candidate, just with JSON Schema instead:

import schema "urn:examples:xmp:bib" at "c:/dev/schemas/eg/bib.xsd"
default element namespace = "urn:examples:xmp:bib"
define function books-by-author($a as element(b:author))
as element(b:title)*
{
  for $b in doc("books.xml")/bib/book
  where some $ba in $b/author satisfies
  ($ba/last=$a/last and $ba/first=$a/first)
  order by $b/title
  return $b/title
}

And ideally, one could create the schema like this:

declare schema my-type as {
  "$schema": "http://json-schema.org/draft/2019-09/schema#",
  "type": "object",
  "properties": { ... },
  "required": [....]
}

Examples

https://example.com/address.json

{
  "$schema": "http://json-schema.org/draft/2019-09/schema#",
  "type": "object",
  "properties": {
    "street_address": {"type": "string"},
    "city": {"type": "string"},
    "state": {"type": "string"}
  },
  "required": ["street_address", "city", "state"]
}

https://example.com/query1.jq

import schema address = "https://example.com/address.json";

declare function local:lookup-addresses($query as string) as address[]
{
  fetch("https://example.com/address-lookup?query=" || uri-encode($query))
};

local:lookup-addresses("acme street")

https://example.com/person.json

{
  "$schema": "http://json-schema.org/draft/2019-09/schema#",
  "type": "object",
  "properties": {
    "first_name": {"type": "string"},
    "last_name": {"type": "string"},
    "preferred_name": {"type": "string"}
  },
  "required": ["preferred_name"]
}

https://example.com/query2.jq

import schema address = "https://example.com/address.json";
import schema person = "https://example.com/person.json";

declare function local:lookup-persons($address as address) as person[]
{
  fetch(
    "https://example.com/lookup-persons" ||
    "?street_address=" || uri-encode($address.street_address)) ||
    "&city=" || uri-encode($address.city)) ||
    "&state=" ||  uri-encode($address.state))
}

declare function local:lookup-addresses($query as string) as address[]
{
  fetch("https://example.com/address-lookup?query=" || uri-encode($query))
};

for $address in local:lookup-addresses("acme street")
return
{
  address: $address,
  persons: local:lookup-persons($address)
}

Example with definitions

https://example.com/order.json

{
  "$schema": "http://json-schema.org/draft/2019-09/schema#",

  "definitions": {
    "address": {
      "$id": "#address",
      "type": "object",
      "properties": {
        "street_address": { "type": "string" },
        "city":           { "type": "string" },
        "state":          { "type": "string" }
      },
      "required": ["street_address", "city", "state"]
    }
  },

  "type": "object",

  "properties": {
    "order_id": { "type": "string" },
    "order_lines": { .... },
    "billing_address": { "$ref": "#/definitions/address" },
    "shipping_address": { "$ref": "#/definitions/address" }
  }
}

https://example.com/query3.jq

import schema order = "https://example.com/order.json";

declare function local:lookup-billing-address($order-id as string)
as order.definitions.address (: or order#address :)
{
  let $order as order := fetch-json("https://example.com/orders?order_id=" || uri-encode($order-id))
  return  $order.billing_address
};

local:lookup-billing-address("ORD-1234")

JSON Schema to JSONiq types

RumbleDB should turn types in schemas to JSONiq types, such as:

JSONiq JSON Schema OpenAPI
string { type: "string" } Same
base64Binary { type: "string", contentEncoding: "base64" } { type: "string", format: "byte" }

Other constraints such as maxProperties should be ignored, and are only used by linters.

OpenAPI 3 example

import schema pet-store = "https://github.com/swagger-api/swagger-petstore/blob/master/src/main/resources/openapi.yaml";

declare function local:lookup-pet($pet-id as integer)
as pet-store.components.schemas.Pet
{
  fetch-json("https://petstore3.swagger.io/api/v3/pets/" || $pet-id)
};

JSound support

If JSound schemas has a way of identifying itself like JSON Schema $schema: "..." OpenAPI openapi: "3.0.2", one should be able to import JSound schemas as well.

ghislainfourny commented 3 years ago

Many thanks for your comments.

Indeed annotate() is only a temporary function. We are actively working on static typing as well as supporting user-defined types right now, first with JSound, and we may consider other schema languages at a later point.

So your request confirms that this is a useful feature and that the timing is good. Thank you for this!

jsommr commented 3 years ago

Just looked at some of your recently closed pull requests on this topic and I gotta say, I'm really, really excited about this!

ghislainfourny commented 3 years ago

Thank you, Jan!

The next release will include a limited user-defined-type system with the JSound compact syntax. The type names can be reused as variable/parameter types and instance of/treat as expressions.

Example:

declare type local:id-and-sentence as { "id": "integer", "sentence": "string" }; let $local-data := ( {"id": 1, "sentence": "Hi I heard about Spark"}, {"id": 2, "sentence": "I wish Java could use case classes"}, {"id": 3, "sentence": "Logistic regression models are neat"} ) let $validated-data := validate type local:id-and-sentence* { $local-data }

return $validated-data instance of local:id-and-sentence+

The functionality will then continue to be expanded in subsequent releases.