elastic / ecs

Elastic Common Schema
https://www.elastic.co/what-is/ecs
Apache License 2.0
988 stars 410 forks source link

JSON schema generator #1687

Open th0ger opened 2 years ago

th0ger commented 2 years ago

Summary

Provide generator script for converting ECS YML schema into a JSON schema standard.

Motivation

The ECS standard describes how to store logs inside Elasticsearch, and is tied to specific implementation details such as Elasticsearch datatypes:

The Elastic Common Schema (ECS) is an open source specification, developed with support from the Elastic user community. ECS defines a common set of fields to be used when storing event data in Elasticsearch, such as logs and metrics.

ECS specifies field names and Elasticsearch datatypes for each field, and provides descriptions and example usage

However, it is not so clear if ECS also aims to provide a standard JSON logging format for 3rd-party application developers (upstream logging before ingesting into ES). There are several logging libraries available, which obviously suggests that app developers/organizations can benefit from logging directly in ECS JSON format.

Assuming we ask developers to log directly in this JSON format, then there is an obvious shortcoming: There is no JSON schema defined for ECS logs. An obvious schema-specification language would be https://json-schema.org/.

A JSON schema can support multiple use cases, such as:

Detailed Design

It could make sense to write a generator script to convert YML schemas into the "json-schema.org" standard. This would support custom fields being defined in YML format and converted into custom json schema.

One challenge is to convert from rich Elasticsearch data types to json data types. Does there exist a well-defined mapping?

djptek commented 2 years ago

Hi @th0ger there is a JSON definition for the Elasticsearch mapping here

Might this be an option to aid JSON validation?

th0ger commented 2 years ago

Hi @djptek, not sure I understand the relevance of that template, could you elaborate?

djptek commented 2 years ago

Hi @th0ger the template describes Elasticsearch mappings which include the Elasticsearch data type for all ECS fields

th0ger commented 2 years ago

@djptek Yes, and that is essentially the same information as stored in then YML schemas (just packaged in json wrap.) It doesn't say anything about what the corresponding json-type should be for each field. Or how Elasticsearch types in general map to json types. Or do I miss something?

ebeahan commented 2 years ago

Hi, @th0ger!

However, it is not so clear if ECS also aims to provide a standard JSON logging format for 3rd-party application developers

I agree that supporting a schema spec language like JSON Schema could be valuable for devs adopting ECS in their app logs. The current tooling was created to maintain the schema and create artifacts to help devs and admins manage ECS in their deployments. However, expanding the project to better support developers with the tools and formats is part of a future vision for ECS.

Assuming we ask developers to log directly in this JSON format. Then there is an obvious shortcoming: There is no JSON schema defined for ECS logs. An obvious schema-specification language would be https://json-schema.org/.

There have been past discussions of using JSON schema to validate the schema YAML files themselves (#463), and I have a proof-of-concept here. I've also made a demonstration of how an ECS-based index template could be validated using a generated JSON schema-spec: https://github.com/ebeahan/ecs-validator/blob/main/schemas/ecs-1.9.0.json.

What's proposed here would validate the JSON events produced by an app, so neither of those concepts matches precisely. However, I linked them here since they could inspire future work.

I could make sense to write a generator script to convert YML schemas into the "json-schema.org" standard.

Yes, having a generator would support folks bringing in their custom fields and some of the other ECS tooling features (like --subset)

One challenge is to convert from rich Elasticsearch data types to json data types. Does there exist a well-defined mapping?

I don't know of such a mapping off-hand, but yes, something like this would need to be defined.

th0ger commented 2 years ago

@ebeahan, thanks for your welcoming feedback. It's nice to hear that we are on the same page. I already read the good references you post, prior to submitting this issue.

rrjjvv commented 2 years ago

If I understand the request and discussion, this is something we really needed as well. Our use case was that we enabled strict validation on our indexes and logs were getting dropped; we needed a way for devs to know their app logs complied with the schema and would be accepted ahead of time. We built our own converter, inspired by this partial implementation. (After coming across that, I was surprised a proper converter had never been done, since Elastic had a need for it themselves, but I also have guesses as to why they haven't.)

Our converter is in python, but using that javascript as a start got us 90% of the way there. The other 10% are conversions/overrides that probably work, but things our company doesn't currently utilize now. So far it has been 100% accurate for us (we use an unmodified ECS common with some additional custom fields). I'm far from claiming it's a "correct" implementation, but it has served us well.

th0ger commented 2 years ago

@rrjjvv I read your use case as in line with my proposal.

Nice to see the reference to jsonSchemaTypeFromEcsType(), it's a key nontrivial conversion. I wonder of there is one "right"/"best" type conversion function, or if it's up for discussion depending on user needs.

rrjjvv commented 2 years ago

Looking back at my code (it has been a long time), most of the non-trivial code is dedicated to handling fields with "normalizations" and maintaining the nested tree of objects. And a little bit just trying to preserve as much of the schema data (not related to validation itself) as possible, like mapping ECS "short" to jsonschema "title", converting ECS "example" to jsonschema "examples", etc. But for validation itself, that jsonSchemaTypeFromEcsType() will indeed get you most of the way there.