[Feature Request] Validation for `bytes`

bufbuild / protovalidate

Protocol Buffer Validation - Go, Java, Python, and C++ Beta Releases!

https://buf.build/bufbuild/protovalidate

Apache License 2.0

885 stars 37 forks source link

[Feature Request] Validation for `bytes` #264

Open ash2k opened 5 days ago

ash2k commented 5 days ago

Feature description:

Allow to check that a bytes field contains only allowed octets. This cannot be done with CEL matches() because bytes in the field may not be valid UTF-8.

Problem it solves or use case:

I want to validate that a bytes field has only certain bytes in it. Specifically, that it's a valid HTTP header name / value.

HTTP header values are not UTF-8, can have binary data: https://datatracker.ietf.org/doc/html/rfc9110#name-field-values.

I also want to avoid the cost of validating HTTP header names as UTF-8 that proto does (so have to use bytes), but instead to validate them manually.

Proposed implementation or solution:

Contribution:

Willing to do code review.

Examples or references:

Additional context:

https://github.com/bufbuild/protovalidate/issues/263.

rodaine commented 3 days ago

matches supports escaping of octet values. We do this in our conformance tests. Is that not sufficient?

ash2k commented 3 days ago

Indeed, it works fine! I got confused by the description of the validation rule. I read it as if it says that the data that is being validated must be utf-8 encoded, not that the pattern itself should be utf-8 encoded:

// The value of the field must be valid UTF-8 or validation will fail with a // runtime error.

https://github.com/bufbuild/protovalidate/blob/3f80b7667dbd0ba4140767013738f150dfd3df54/proto/protovalidate/buf/validate/validate.proto#L3830-L3845

This worked for me:

message HeaderValues {
  // At least one item, but it can be empty.
  repeated bytes value = 1 [
    (buf.validate.field).repeated.min_items = 1,
    (buf.validate.field).repeated.items.bytes.pattern = "^[^\\x00-\\x08\\x0A-\\x1F\\x7F]*$"
  ];
}

message HeaderKV {
  bytes key = 1 [(buf.validate.field).bytes.pattern = "^:?[0-9a-zA-Z!#$%&'*+-.^_|~`]+$"];
  HeaderValues value = 2 [(buf.validate.field).required = true];
}

Thank you!

rodaine commented 2 days ago

I got confused by the description of the validation rule.

Hm, this might be copy-pasta from the pattern rule over on strings (which does have the UTF-8 requirement). I'll fix up the documentation to be clearer. Thanks!

ash2k commented 2 days ago

@rodaine Thank you for an excellent library!

ash2k commented 2 days ago

Hm, I've added some proper tests and I'm getting an error: runtime error: error evaluating bytes.pattern: invalid UTF-8 in bytes, cannot convert to string. I believe this is because string(this) fails in the CEL expression.

https://github.com/bufbuild/protovalidate/blob/3f80b7667dbd0ba4140767013738f150dfd3df54/proto/protovalidate/buf/validate/validate.proto#L3842-L3845

I think we might need an overload for matches() that works natively on bytes.

ash2k commented 1 day ago

I think the error is generated here: https://github.com/google/cel-go/blob/master/common/types/bytes.go#L104-L106