deercreeklabs / lancaster

Apache Avro library for Clojure and ClojureScript
Other
60 stars 5 forks source link

Lancaster

Installation

Using Leiningen / Clojars:

Clojars Project

About

Lancaster is an Apache Avro library for Clojure and ClojureScript. It aims to be fully compliant with the Avro Specification. It is assumed that the reader of this documentation is familiar with Avro and Avro terminology. If this is your first exposure to Avro, please read the Avro Overview and the Avro Specification before proceeding.

Lancaster provides for:

Lancaster does not support:

Project Name

The Avro Lancaster was an airplane manufactured by Avro Aircraft.

Examples

Here is an introductory example of using Lancaster to define a schema, serialize data, and then deserialize it.

(require '[deercreeklabs.lancaster :as l])

(l/def-record-schema person-schema
  [:name l/string-schema]
  [:age l/int-schema])

(def alice
  {:name "Alice"
   :age 40})

(def encoded (l/serialize person-schema alice))

(l/deserialize person-schema person-schema encoded)
;; {:name "Alice" :age 40}

Here is a more complex example using nested schemas:

(require '[deercreeklabs.lancaster :as l])

(l/def-enum-schema hand-schema
  :left :right)

(l/def-record-schema person-schema
  [:name l/string-schema]
  [:age l/int-schema]
  [:dominant-hand hand-schema]
  [:favorite-integers (l/array-schema l/int-schema)]
  [:favorite-color l/string-schema])

(def alice
  {:name "Alice"
   :age 40
   :favorite-integers [12 59]
   :dominant-hand :left})
   ;; :favorite-color is missing. Record fields are optional by default.

(def encoded (l/serialize person-schema alice))

(l/deserialize person-schema person-schema encoded)
;; {:name "Alice", :age 40, :dominant-hand :left, :favorite-integers [12 59], :favorite-color nil}

Creating Schema Objects

Lancaster schema objects are required for serialization and deserialization. These can be created in two ways:

  1. Using an existing Avro schema in JSON format. To do this, use the json->schema function. This is best if you are working with externally defined schemas from another system or language.
  2. Using Lancaster schema functions and/or macros. This is best if you want to define Avro schemas using Clojure/ClojureScript. Lancaster lets you concisely create and combine schemas in arbitrarily complex ways, as explained below.

Primitive Schemas

Lancaster provides predefined schema objects for all the Avro primitives. The following vars are defined in the deercreeklabs.lancaster namespace:

These schema objects can be used directly or combined into complex schemas.

Complex Schemas

Most non-trivial Lancaster use cases will involve complex Avro schemas. The easiest and most concise way to create complex schemas is by using the Schema Creation Macros. For situations where macros do not work well, the Schema Creation Functions are also available.

Schema Creation Macros

Schema Creation Functions

Operations on Schema Objects

All of these functions take a Lancaster schema object as the first argument:

Data Types

Serialization

When serializing data, Lancaster accepts the following Clojure(Script) types for the given Avro type:

Avro Type Acceptable Clojure / ClojureScript Types
null nil
boolean boolean
int int, java.lang.Integer, long (if in integer range), java.lang.Long (if in integer range), js/Number (if in integer range)
long long, java.lang.Long
float float, java.lang.Float, double (if in float range), java.lang.Double (if in float range), js/Number (if in float range)
double double, java.lang.Double, js/Number
bytes byte-array, java.lang.String, js/Int8Array, js/String
string byte-array, java.lang.String, js/Int8Array, js/String
fixed byte-array, js/Int8Array. Byte array length must equal the size declared in the creation of the Lancaster fixed schema.
enum Simple (non-namespaced) keyword
array Any data that passes (sequential? data)
map Any data that passes (map? data), if all keys are strings. Clojure(Script) records DO NOT qualify, since their keys are keywords, not strings.
map (w/ null values schema) If the values in the map schema is null, the schema is interpreted to represent a Clojure(Script) set, and the data must be a set of strings. Only strings can be elements of this set.
record Any data that passes (map? data), if all keys are Clojure(Script) simple (non-namespaced) keywords. Clojure(Script) records DO qualify, since their keys are keywords.
union Any data that matches one of the member schemas declared in the creation of the Lancaster union schema. Note that there are some restrictions on what schemas may be in a union schema, as explained in Notes About Union Data Types below.

Deserialization

When deserializing data, Lancaster returns the following Clojure or ClojureScript types for the given Avro type:

Avro Type Clojure Type ClojureScript Type
null nil nil
boolean boolean boolean
int java.lang.Integer js/Number
long java.lang.Long goog.Long
float java.lang.Float js/Number
double java.lang.Double js/Number
bytes byte-array js/Int8Array
string java.lang.String js/String
fixed byte-array js/Int8Array
enum keyword keyword
array vector vector
map hash-map hash-map
map (w/ null values schema) set (w/ string elements) set (w/ string elements)
record hash-map hash-map
union Data that matches one of the member schemas declared in the creation of the Lancaster union schema.

Notes About Union Data Types

To quote the Avro spec:

Unions may not contain more than one schema with the same type, except for the named types record, fixed and enum. For example, unions containing two array types or two map types are not permitted, but two types with different names are permitted.

In additon to the above, Lancaster disallows unions with:

At union schema creation time, Lancaster will throw an exception if the the schema is disallowed.

The :lancaster/record-name Attribute

In Lancaster, Avro records are modeled as Clojure maps. At serialization time, Lancaster will attempt to determine which record schema in a union applies to the given map. It does this by checking the keys of the given map against the keys of the records in the union schema. If there is no overlap in the keys of the records in the union schema, this can be done unambiguously. However, if there is any overlap in the keys of the records, the caller to serialize must indicate which record schema is represented by the given map. This is done by adding a :lancaster/record-name key to the given map. The value of this key must be the name of the Avro record that the map represents. If this key required but missing, Lancaster will throw an exception at serialization time.

At deserialization time, if a union has overlapping records, the deserialized record will include a :lancaster/record-name key in the map. Whether or not this key is added may be controlled by passing the :add-record-name option to deserialize or deserialize-same. The values of this option may be:

Names and Namespaces

Named Avro schemas (records, enums, fixeds) contain a name part and, optionally, a namespace part. The Names section of the Avro spec describes this in detail. Lancaster fully supports the spec, allowing both names and namespaces. These are combined into a single fullname, including both the namespace (if any) and the name.

Lancaster schema names and namespaces must start with a letter and subsequently only contain letters, numbers, or hyphens.

When using the Schema Creation Macros, the name used in the schema is derived from the name of the symbol passed to the def-*-schema macro. It the symbol ends with -schema (as is common), the -schema portion is dropped from the name. The namespace is taken from the Clojure(Script) namespace where the schema is defined.

When using the Schema Creation Functions, the name and namespace are taken from the name-kw parameter passed to the *-schema function. If the keyword is namespaced, the keyword's namespace is used as the schema's namespace. If the keyword does not have a namespace, the schema will not have a namespace. Only the functions for creating named schemas (enum-schema, fixed-schema, and record-schema) have a name-kw parameter.

In the EDN representation of a named schema, the :name attribute contains the name of the schema, including the namespace, if any. In the JSON and PCF representations, the name portion is converted from kebab-case to PascalCase, and any namespace is converted from kebab-case to snake_case. This matches the Avro spec (which does not allow hyphenated names) and provides for easy interop with other languages (Java, JS, C++, etc.)

For example, using the def-enum-schema macro:

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

(l/edn suite-schema)
;; {:name :user/suite, :type :enum, :symbols [:clubs :diamonds :hearts :spades]}
;; Note that the :name includes the namespace (:user in this case)
;; and that the name is 'suite', not 'suite-schema'

(l/json suite-schema)
;; "{\"name\":\"user.Suite\",\"type\":\"enum\",\"symbols\":[\"CLUBS\",\"DIAMONDS\",\"HEARTS\",\"SPADES\"]}"
;; Note that the name has been converted to user.Suite

Or using the enum-schema function:

(def suite-schema
  (l/enum-schema :a-random-ns/suite [:clubs :diamonds :hearts :spades]))

(l/edn suite-schema)
;; {:name :a-random-ns/suite, :type :enum, :symbols [:clubs :diamonds :hearts :spades]}
;; Note that the namespace is not :user, but is :a-random-ns

(l/json suite-schema)
;; "{\"name\":\"a_random_ns.Suite\",\"type\":\"enum\",\"symbols\":[\"CLUBS\",\"DIAMONDS\",\"HEARTS\",\"SPADES\"]}"
;; Note that the name has been converted to a_random_ns.Suite

API Documentation

All public vars, functions, and macros are in the deercreeklabs.lancaster namespace. All other namespaces should be considered private implementation details that may change.


def-record-schema

(def-record-schema name-symbol & fields)

Defines a var whose value is a Lancaster schema object representing an Avro record For cases where a macro is not appropriate, use the record-schema function instead.

Parameters

Return Value

The defined var

Example

(l/def-record-schema person-schema
  [:name :required l/string-schema "no name"]
  [:age "The person's age" l/int-schema])

See Also


def-enum-schema

(def-enum-schema name-symbol & symbol-keywords)

Defines a var whose value is a Lancaster schema object representing an Avro enum. For cases where a macro is not appropriate, use the enum-schema function instead.

Parameters

Return Value

The defined var

Example

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

See Also


def-fixed-schema

(def-fixed-schema name-symbol size)

Defines a var whose value is a Lancaster schema object representing an Avro fixed. For cases where a macro is not appropriate, use the fixed-schema function instead.

Parameters

Return Value

The defined var

Example

(l/def-fixed-schema md5-schema
  16)

See Also


def-array-schema

(def-array-schema name-symbol items-schema)

Defines a var whose value is a Lancaster schema object representing an Avro array. For cases where a macro is not appropriate, use the array-schema function instead.

Parameters

Return Value

The defined var

Example

(l/def-array-schema numbers-schema
  l/int-schema)

See Also


def-map-schema

(def-map-schema name-symbol values-schema)

Defines a var whose value is a Lancaster schema object representing an Avro map. For cases where a macro is not appropriate, use the map-schema function instead.

Parameters

Return Value

The defined var

Example

(l/def-map-schema name-to-age-schema
  l/int-schema)

See Also


def-union-schema

(def-union-schema name-symbol & member-schemas)

Defines a var whose value is a Lancaster schema object representing an Avro union. For cases where a macro is not appropriate, use the union-schema function instead.

Parameters

Return Value

The defined var

Example

(l/def-union-schema maybe-name-schema
  l/null-schema l/string-schema)

See Also


def-maybe-schema

(def-maybe-schema name-symbol schemas)

Defines a var whose value is a Lancaster schema object representing an Avro union. The members of the union are null-schema and the given schema. Makes a schema nillable. For cases where a macro is not appropriate, use the maybe function instead.

Parameters

Return Value

The defined var

Example

(l/def-maybe-schema maybe-name-schema
  l/string-schema)

See Also


record-schema

(record-schema name-kw fields)

Creates a Lancaster schema object representing an Avro record, with the given name keyword and field definitions. For a more concise way to declare a record schema, see def-record-schema.

Parameters

Return Value

The new Lancaster record schema

Example

(def person-schema
  (l/record-schema :person
                   "A schema representing a person."
                   [[:name :required l/string-schema "no name"]
                    [:age l/int-schema]]))

See Also


enum-schema

(enum-schema name-kw symbol-keywords)

Creates a Lancaster schema object representing an Avro enum, with the given name and keyword symbols. For a more concise way to declare an enum schema, see def-enum-schema.

Parameters

Return Value

The new Lancaster enum schema

Example

(def suite-schema
  (l/enum-schema :suite [:clubs :diamonds :hearts :spades]))

See Also


fixed-schema

(fixed-schema name-kw size)

Creates a Lancaster schema object representing an Avro fixed, with the given name and size. For a more concise way to declare a fixed schema, see [[def-fixed-schema]].

Parameters

Return Value

The new Lancaster fixed schema

Example

(def md5-schema
  (l/fixed-schema :md5 16))

See Also


array-schema

(array-schema items-schema)

Creates a Lancaster schema object representing an Avro array with the given items schema.

Parameters

Return Value

The new Lancaster array schema

Example

(def numbers-schema (l/array-schema l/int-schema))

See Also


map-schema

(map-schema values-schema)

Creates a Lancaster schema object representing an Avro map with the given values schema.

Parameters

Return Value

The new Lancaster map schema

Examples

(def name-to-age-schema (l/map-schema l/int-schema))

See Also


union-schema

(union-schema member-schemas)

Creates a Lancaster schema object representing an Avro union with the given member schemas.

Parameters

Return Value

The new Lancaster union schema

Examples

(def maybe-name-schema
  (l/union-schema [l/null-schema l/string-schema]))

See Also


maybe

(maybe schema)

Creates a Lancaster union schema whose members are l/null-schema and the given schema. Makes a schema nillable.

Parameters

Return Value

The new Lancaster union schema

Example

(def int-or-nil-schema (l/maybe l/int-schema))

See Also


serialize

(serialize writer-schema data)

Serializes data to a byte array, using the given Lancaster schema.

Parameters

Return Value

A byte array containing the Avro-encoded data

Example

(l/def-record-schema person-schema
  [:name l/string-schema]
  [:age l/int-schema])

(def encoded (l/serialize person-schema {:name "Arnold"
                                         :age 22}))

See Also


deserialize

(deserialize reader-schema writer-schema ba)
(deserialize reader-schema writer-schema ba opts)

Deserializes Avro-encoded data from a byte array, using the given reader and writer schemas. The writer schema must be resolvable to the reader schema. See Avro Schema Resolution. If the reader schema contains record fields that are not in the writer's schema, the fields' default values will be used. If no default was explicitly specified in the schema, Lancaster uses the following default values, depending on the field type:

Parameters

Return Value

The deserialized data

Example

(def person-schema
  (l/record-schema :person
                   [[:name l/string-schema "no name"]
                    [:age l/int-schema]]))

(def person-w-nick-schema
  (l/record-schema :person
                   [[:name l/string-schema "no name"]
                    [:age l/int-schema]
                    [:nickname l/string-schema "no nick"]
                    [:favorite-number l/int-schema]]))

(def encoded (l/serialize person-schema {:name "Alice"
                                         :age 20}))

(l/deserialize person-w-nick-schema person-schema encoded)
;; {:name "Alice", :age 20, :nickname "no nick", :favorite-number -1}

See Also


deserialize-same

(deserialize-same schema ba)
(deserialize-same schema ba opts)

Deserializes Avro-encoded data from a byte array, using the given schema as both the reader and writer schema.

Note that this is not recommended, since it does not allow for schema resolution / evolution. The original writer's schema should always be used to deserialize. The writer's schema (in Parsing Canonical Form) should always be stored or transmitted with encoded data. If the schema specified in this function does not match the schema with which the data was encoded, the function will fail, possibly in strange ways. You should generally use the deserialize function instead.

Parameters

Return Value

The deserialized data

Example

(l/def-record-schema dog-schema
  [:name l/string-schema]
  [:owner l/string-schema])

(def encoded (l/serialize dog-schema {:name "Fido"
                                      :owner "Roger"}))

(l/deserialize-same dog-schema encoded)
;; {:name "Fido :owner "Roger"}

See Also


json->schema

(json->schema json)

Creates a Lancaster schema object from an Avro schema in JSON format.

Parameters

Return Value

The new Lancaster schema

Example

(def person-schema
        (l/json->schema
         (str "{\"name\":\"Person\",\"type\":\"record\",\"fields\":"
              "[{\"name\":\"name\",\"type\":\"string\",\"default\":\"no name\"},"
              "{\"name\":\"age\",\"type\":\"int\",\"default\":-1}]}")))

edn

(edn schema)

Returns the EDN representation of the given Lancaster schema.

Parameters

Return Value

EDN representation of the given Lancaster schema

Example

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

(l/edn suite-schema)
;; {:name :suite, :type :enum, :symbols [:clubs :diamonds :hearts :spades]}

See Also


json

(json schema)

Returns an Avro-compliant JSON representation of the given Lancaster schema.

Parameters

Return Value

JSON representation of the given Lancaster schema

Example

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

(l/json suite-schema)
;; "{\"name\":\"Suite\",\"type\":\"enum\",\"symbols\":[\"CLUBS\",\"DIAMONDS\",\"HEARTS\",\"SPADES\"]}"

See Also


pcf

(pcf schema)

Returns a JSON string containing the Parsing Canonical Form for the given Lancaster schema.

Parameters

Return Value

A JSON string

Example

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

(l/pcf suite-schema)
;; "{\"name\":\"Suite\",\"type\":\"enum\",\"symbols\":[\"CLUBS\",\"DIAMONDS\",\"HEARTS\",\"SPADES\"]}"
;; Note that this happens to be the same as (l/json suite-schema) for this
;; particular schema. That is not generally the case.

See Also


fingerprint64

(fingerprint64 schema)

Returns the 64-bit Rabin fingerprint of the Parsing Canonical Form for the given Lancaster schema.

Parameters

Return Value

A 64-bit Long representing the fingerprint. For JVM Clojure, this is a java.lang.Long. For ClojureScript, it is a goog.math.Long.

Example

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

(l/fingerprint64 suite-schema)
5882396032713186004

See Also


fingerprint128

(fingerprint128 schema)

Returns the 128-bit MD5 Digest of the Parsing Canonical Form for the given Lancaster schema.

Parameters

Return Value

A byte array of 16 bytes (128 bits) representing the fingerprint.

Example

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

(l/fingerprint128 suite-schema)
[92, 31, 14, -85, -40, 26, 121, -60, -38, 4, -81, -125, 100, 71, 101, 94]

See Also


fingerprint256

(fingerprint256 schema)

Returns the 256-bit SHA-256 hash of the Parsing Canonical Form for the given Lancaster schema.

Parameters

Return Value

A byte array of 32 bytes (256 its) representing the fingerprint.

Example

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

(l/fingerprint256 suite-schema)
[-119, 91, -127, 37, -2, 96, 35, -95, 79, -123, -108, -27, -49, 39,
 118, 95, -106, -34, -72, 63, -118, -33, -123, -10, -19, 96, 33, -40,
 73, -34, 25, -109]

See Also


schema?

(schema? arg)

Returns a boolean indicating whether or not the argument is a Lancaster schema object.

Parameters

Return Value

A boolean indicating whether or not the argument is a Lancaster schema object

Example

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

(l/schema? suite-schema)
;; true

(l/schema? :clubs)
;; false

default-data

(default-data schema)

Creates default data that conforms to the given Lancaster schema. The following values are used for the primitive data types:

Default data for complex schemas are built up from the primitives.

Parameters

Return Value

Data that matches the given schema

Example

(l/def-enum-schema suite-schema
  :clubs :diamonds :hearts :spades)

(l/default-data suite-schema)
;; :clubs

Developing

Issues and PRs are welcome. When submitting a PR, please run the Clojure, Browser, and Node unit tests on your PR. Here's how:

If your PR might affect performance, it can be helpful to run the performance tests (bin/kaocha perf).


License

Copyright (c) 2017-2019 Deer Creek Labs, LLC

Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Apache Software Foundation.

Distributed under the Apache Software License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt