cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.22k stars 3.82k forks source link

sql: Add protobuf builtins and column type #47534

Open dt opened 4 years ago

dt commented 4 years ago

We currently store encoded protobuf messages in bytes columns in many cases, as they neatly handle collecting and encoding system-specific fields. However currently this has the downside of making that column totally opaque to SQL interactions -- in order to manually read a given field, or even harder, manipulate one, we end up copy-pasting those bytes out protoc or custom-written one-off programs. This complicates debugging during incidents (when we often want to peak into system tables which contain encoded protos in ways that sometimes don't have pre-written debugging tools to expose them) and is just generally source of friction.

Instead, it would be nice if cockroach could natively decode, and ideally also encode, proto messages stored in bytes columns. Obviously this requires that cockroach have the message descriptor.

One quick win here could be a pair of builtin functions: proto_decode(msg bytes, desc bytes) -> jsonb and proto_encode(msg json, desc bytes) -> bytes which translate bytes to/from json, using the supplied message descriptor.

Even nicer though would be to have a native proto column type, which is parameterized with a supplied message descriptor. With this persisted in the TableDescriptor, the column could then be rendered as JSON natively in SQL transparently, without the cruft of the builtins or constantly re-supplying the message descriptor.

Ideally we'd also extend our ALTER .. SET TYPE support to include updating the persisted message descriptor, and maybe even allow upgrading a bytes column to/from a proto column in place (since this would just be changing the type and desc in the metadata, as the on-disk data is the same either way).

The builtin approach is likely easier, and even complementary with the native type, as it lays the groundwork and provides a compatibility bridge for those, like us, with existing bytes columns.

Jira issue: CRDB-4401

lancerutkin commented 4 years ago

Wouldn't technically be my first issue, but I'd like to try this if that's ok. Will probably take the builtin+native type approach as I've spent time in builtin before. Looks like for the proto type, I'd be working in pkg/sql/types/types.go, with new tests going in types_test.go in that same directory. Let me know if I'm off-base on that.

If you know of any particular existing types off of which I might want to model my work, that'd be extremely helpful.

otan commented 4 years ago

I'd suggest avoid adding a new type initially as it triggers a lot of tests to break because it assumes the type can be encoded to, decoded from, etc. dumpable, printable, etc... You can follow along these PRs (https://github.com/cockroachdb/cockroach/pull/47171) (https://github.com/cockroachdb/cockroach/pull/42481) to see it's a little....much. Maybe later as an extension (but it probably deserves an RFC or at least a one-pager describing format, encoding, display, dumping, etc).

For the builtin approach, proto_decode and proto_encode shouldn't need a new type (at least initially) - you can just rely on it being bytes. It should probably just take in two byte arrays, figure out the descriptor from the descriptor arg, and do the decoding/encoding from there. I think a sample query may be helpful to make sure we understand it (and for something to test against).

dt commented 4 years ago

As far as a test case, the one that made me wish we had this would be something like:

select proto_decode(descriptor, '\x...') from system.descriptor;

where '\x...' is the bytes for the (proto message) descriptor for our cockroach.sql.sqlbase.Descriptor message (in structured.proto).

dt commented 4 years ago

Actually, after refreshing my memory of how the proto descriptor stuff works, I guess we might need the whole encoded FileDescriptorSet, not just the one message descriptor, since it could embed other messages? I guess that would mean we'd need another arg as well -- the name of the message type within that set? I haven't played with this recently enough to be sure though.

petermattis commented 4 years ago

Where do the desc bytes for the proto_decode() builtin come from? Unless specifying desc bytes is easy, the proto_decode() builtin doesn't seem like a large improvement over the current situation. Am I missing something here?

Adding a PROTO type seems like a much nicer long term solution. We're laying the groundwork for custom types in 20.2 with Enum. I suppose we should take care to make that support general.

dt commented 4 years ago

Yeah, I agree the native type sounds much better, but the built-ins seems like a good place to start? I think the built-in wouldn't be that bad when used in an app if that app has that proto linked in? I think it'd look something like

eventProto, _ := EventDetail{}.Descriptor()
...
res, err := db.Query(
  `SELECT ... FROM events WHERE proto_decode(details, $1)->category = ...`,  eventProto,
)

Interactive shell would be a bit clunkier, but even then, I could still see it being better than nothing: I can ask protoc to dump the desc once and then paste that in a query like the above if I'm in a bind, which is still better than, say, having to go build a custom binary or something.

You could also imagine making a table proto_descs name -> desc, then use a subquery to avoid going back to protoc and pasting repeatedly, e.g. proto_decode(details, (SELECT desc FROM proto_desc WHERE name = 'EventDetail')) if you wanted.

But yeah, I think the ideal is a native column type that is parameterized with the file desc set and expected message name so you can set once and then just interact with it seamlessly.

thoszhang commented 4 years ago

Summarizing some discussions elsewhere: It would be good if this got us closer to building an index on the descriptor type in system.descriptor, to avoid scanning the entire table when, e.g., we only need to fetch user-defined types (as noted in the RFC).

lancerutkin commented 4 years ago

Sorry for the delay in updating, but I don't think I'll be able to get to this in a timely fashion. In case anyone else was looking to jump in, I don't want to block them.

cpustejovsky commented 1 year ago

To make sure I understand the issue as it currently stands, PR #52169 created the builtin functions, but cockroach still is not able to natively decode and encode proto messages stored in bytes columns given a message descriptor containing a native proto column type.

This would the next step in tackling this issue, correct?

dt commented 1 year ago

The current builtin functions are internal-use only: they only allow de/encoding bytes from/to any proto message types that are directly compiled into CockroachDB itself, where we use protobuf messages defined in the CockroachDB codebase to store things in system tables like jobs, schema descriptors, etc.

I think this feature request is more about allowing a user-supplied message descriptor to be used to marshal or unmarshal messages for user-supplied message types, not just the built-in descriptors / message types. Once you have that facility, you could go further and persist a user-supplied message descriptor somewhere, either where they can then be used by name or potentially in a table schema along side a bytes column or in a user-defined type, so that columns of that type get the automatic ser/de treatment?

Either way, we haven't yet added functions that allow interacting with anything other than pre-defined system-internal proto message types.

cpustejovsky commented 1 year ago

@dt I've made some progress on this and have the results of it in a separate repository https://github.com/cpustejovsky/filedescriptorjson

It is currently separated because I wasn't sure how to get a test FileDescriptor to use for the unit test from the code inside the cockroach repo since most of this functionality is with the golang/protobuf library and isn't available within the cockroach repository.

It seems like using a FileDescriptor or FileDesciptorSet is ideal as a bare MessageDescriptor won't always have sufficient context (here is a relevant conversation on the Gopher slack explaining).

It should be possible to use the new google/protobuf package to do this and return a jsonb.JSON type. I think all I would need to test this is to add a generated pb.go file that had the FileDescriptor property like seen here and use that for the unit test. Is that something I could add in a PR in addition to the functionality and unit tests?

cpustejovsky commented 11 months ago

@dt I've written a PR (https://github.com/cockroachdb/cockroach/pull/116262) to address the first step.

I sorted out the problems I was having earlier around testing, but to handle linting errors, I created wrapper functions TODOMarshal and TODOUnmarshal inside pkg/util/protoutil/marshal.go

I marked them as TODO because I wasn't sure if this solution was acceptable and, if it is, what the names should be to properly distinguish them from the Marshal and Unmarshal functions that rely on gogo.

dt commented 11 months ago

Exciting stuff @cpustejovsky !

Let me ask around internally to see if I can find someone closer to this area of the codebase to work with you on answering some of those questions.

cpustejovsky commented 10 months ago

@dt were you able to find someone?

dt commented 10 months ago

Hi @cpustejovsky, I'm so sorry this slipped off my queue while I was out for a couple weeks over the holidays and I dropped the ball here. Let me ask around again and get back to you, and sorry again about the delay!

cpustejovsky commented 10 months ago

That's what I figured! No worries

cpustejovsky commented 9 months ago

@dt It looks like there won't be updates to this issue any time soon based on this this comment:

Hi @cpustejovsky, we appreciate your contribution and the time you've invested in creating this PR. Unfortunately, our DB team is facing a high workload at the moment and we likely won't be able to review your PR anytime soon. My apologies for the inconvenience. We will prioritize this for review once we have more bandwidth available.

If you have any questions or concerns in the meantime, please feel free to reach out. We appreciate your understanding and continued collaboration.