elastic / ecs

Elastic Common Schema
https://www.elastic.co/what-is/ecs
Apache License 2.0
1k stars 417 forks source link

YAML validation #463

Open codebrain opened 5 years ago

codebrain commented 5 years ago

Can we look at implementing some kind of schema validation on the YAML files? Something along the lines of: https://github.com/paazmaya/yaml-validator?

Alternatively, perhaps a move towards a schema type that has native validation? JSON Schema?

ruflin commented 5 years ago

Can you elaborate a bit on where the users would use / deploy this validator?

codebrain commented 5 years ago

The thrust of my argument comes from the code generation perspective, whereby the specification will be used as the source of truth for generating C# types and helpers. In these instances, the downstream impact of an incorrect or incomplete specification often means breaking binary compatibility in the generated artifact. I believe this is also true for other statically typed languages, not just C#.

Anything we can do to validate that the specification is complete and correct minimises some of this downstream impact.

Perhaps this validation can be tied to the CI system, such that an update to the YAML specification results in a CI validation check to assert that the YAML is;

  1. Well formed - parsable
  2. Complete - all required fields and their values are present
  3. Correct - all field types match expected types and the values are consistent with known boundaries
ruflin commented 5 years ago

++. I remember just recently seeing a PR from @jsoriano in Beats which does some YAML validation. Perhaps we can reuse parts from there?

jsoriano commented 5 years ago

In https://github.com/elastic/beats/pull/13188 I am adding a check to Beats build to validate that the fields definitions have correct types and formats. Beats case is a bit special because we compose the big fields definition YAML file by concatenating many small YAML-like files, and we were not checking this validity yet, so we could end-up with obscure errors in tests, or with some unexpected behaviours.

For the check I am using the LoadFieldsYaml() function from github.com/elastic/beats/libbeat/mapping, what by itself is enough to check if the final YAML is well formed. Under the hood this function uses go-ucfg (github.com/elastic/go-ucfg), that allows to define custom validation for each parsed object, this way I am checking that format and type are correct. Other checks could be added to better check the completeness and correctness of all fields.

What I think that go-ucfg doesn't cover yet is the detection of unsupported attributes in parsed objects.

webmat commented 5 years ago

I agree with this need. An embryo of some of these ideas is already in place, but is far from perfect.

Here's a few of the things we have already:

The checks in place right now are not enough for my liking, I agree we need more.

Side point. When you talk about bounds checking, I assume you're talking about the ignore_above param on keyword fields? This param is a kind of protection for Elasticsearch. A user can basically have as much text as they want in these fields. All this param means is that if the text is longer than ignore_above, this doc's value for the field won't be added to the search index. But the field value will still be present in full in the document's source, however. So I'm not sure I would necessarily enforce bounds checking based on that.

Related to this, there is work that needs to happen on adjusting ignore_above in many places, though. You can see past discussions in #105 and #270.

ChrisHines commented 3 years ago

Is a schema validator still under consideration? If so, would a CUE schema (cuelang.org) be a viable alternative?

ebeahan commented 2 years ago

@ChrisHines, yes, this is something valuable to incorporate at some point. I've done some JSON schema work as part of other ECS-related projects that could serve as a starting point: https://github.com/ebeahan/ecs-vscode/blob/main/schema/schema.json