gabledata / recap

Work with your web service, database, and streaming schemas in a single format.
https://recap.build
MIT License
334 stars 24 forks source link
data-catalog data-discovery data-engineering data-integration data-pipelines etl metadata recap
recap

What is Recap?

Recap reads and writes schemas from web services, databases, and schema registries in a standard format.

⭐️ If you like this project, please give it a star! It helps the project get more visibility.

Table of Contents

Supported Formats

Format Read Write
Avro
BigQuery
Confluent Schema Registry
Hive Metastore
JSON Schema
MySQL
PostgreSQL
Protobuf
Snowflake
SQLite

Install

Install Recap and all of its optional dependencies:

pip install 'recap-core[all]'

You can also select specific dependencies:

pip install 'recap-core[avro,kafka]'

See pyproject.toml for a list of optional dependencies.

Usage

CLI

Recap comes with a command line interface that can list and read schemas from external systems.

List the children of a URL:

recap ls postgresql://user:pass@host:port/testdb
[
  "pg_toast",
  "pg_catalog",
  "public",
  "information_schema"
]

Keep drilling down:

recap ls postgresql://user:pass@host:port/testdb/public
[
  "test_types"
]

Read the schema for the test_types table as a Recap struct:

recap schema postgresql://user:pass@host:port/testdb/public/test_types
{
  "type": "struct",
  "fields": [
    {
      "type": "int64",
      "name": "test_bigint",
      "optional": true
    }
  ]
}

Gateway

Recap comes with a stateless HTTP/JSON gateway that can list and read schemas from data catalogs and databases.

Start the server at http://localhost:8000:

recap serve

List the schemas in a PostgreSQL database:

curl http://localhost:8000/gateway/ls/postgresql://user:pass@host:port/testdb
["pg_toast","pg_catalog","public","information_schema"]

And read a schema:

curl http://localhost:8000/gateway/schema/postgresql://user:pass@host:port/testdb/public/test_types
{"type":"struct","fields":[{"type":"int64","name":"test_bigint","optional":true}]}

The gateway fetches schemas from external systems in realtime and returns them as Recap schemas.

An OpenAPI schema is available at http://localhost:8000/docs.

Registry

You can store schemas in Recap's schema registry.

Start the server at http://localhost:8000:

recap serve

Put a schema in the registry:

curl -X POST \
    -H "Content-Type: application/x-recap+json" \
    -d '{"type":"struct","fields":[{"type":"int64","name":"test_bigint","optional":true}]}' \
    http://localhost:8000/registry/some_schema

Get the schema (and version) from the registry:

curl http://localhost:8000/registry/some_schema
[{"type":"struct","fields":[{"type":"int64","name":"test_bigint","optional":true}]},1]

Put a new version of the schema in the registry:

curl -X POST \
    -H "Content-Type: application/x-recap+json" \
    -d '{"type":"struct","fields":[{"type":"int32","name":"test_int","optional":true}]}' \
    http://localhost:8000/registry/some_schema

List schema versions:

curl http://localhost:8000/registry/some_schema/versions
[1,2]

Get a specific version of the schema:

curl http://localhost:8000/registry/some_schema/versions/1
[{"type":"struct","fields":[{"type":"int64","name":"test_bigint","optional":true}]},1]

The registry uses fsspec to store schemas in a variety of filesystems like S3, GCS, ABS, and the local filesystem. See the registry docs for more details.

An OpenAPI schema is available at http://localhost:8000/docs.

API

Recap has recap.converters and recap.clients packages.

Read a schema from PostgreSQL:

from recap.clients import create_client

with create_client("postgresql://user:pass@host:port/testdb") as c:
    c.schema("testdb", "public", "test_types")

Convert the schema to Avro, Protobuf, and JSON schemas:

from recap.converters.avro import AvroConverter
from recap.converters.protobuf import ProtobufConverter
from recap.converters.json_schema import JSONSchemaConverter

avro_schema = AvroConverter().from_recap(struct)
protobuf_schema = ProtobufConverter().from_recap(struct)
json_schema = JSONSchemaConverter().from_recap(struct)

Transpile schemas from one format to another:

from recap.converters.json_schema import JSONSchemaConverter
from recap.converters.avro import AvroConverter

json_schema = """
{
    "type": "object",
    "$id": "https://recap.build/person.schema.json",
    "properties": {
        "name": {"type": "string"}
    }
}
"""

# Use Recap as an intermediate format to convert JSON schema to Avro
struct = JSONSchemaConverter().to_recap(json_schema)
avro_schema = AvroConverter().from_recap(struct)

Store schemas in Recap's schema registry:

from recap.storage.registry import RegistryStorage
from recap.types import StructType, IntType

storage = RegistryStorage("file:///tmp/recap-registry-storage")
version = storage.put(
    "postgresql://localhost:5432/testdb/public/test_table",
    StructType(fields=[IntType(32)])
)
storage.get("postgresql://localhost:5432/testdb/public/test_table")

# Get all versions of a schema
versions = storage.versions("postgresql://localhost:5432/testdb/public/test_table")

# List all schemas in the registry
schemas = storage.ls()

Docker

Recap's gateway and registry are also available as a Docker image:

docker run \
    -p 8000:8000 \
    -e RECAP_URLS=["postgresql://user:pass@localhost:5432/testdb"]' \
    ghcr.io/recap-build/recap:latest

See Recap's Docker documentation for more details.

Schema

See Recap's type spec for details on Recap's type system.

Documentation

Recap's documentation is available at recap.build.