Recommend Arrays in Materializations

dyaffe commented 1 year ago

Problem We currently don't recommend objects in materializations and this nearly always ends up being a problem. Users reach out and say "data is missing".

We should include objects by default in SQL materializations.

psFried commented 1 year ago

What would be the expected behavior given an object like this:

{
  "foo": {
    "a": 1,
    "b": true,
    "c": {
      "c-a": 9,
      "c-b": "s"
    }
  },
  "bar": "ss"
}

Currently, we would recommend the columns: foo/a, foo/b, foo/c/c-a, foo/c/c-b, bar

If we naively recommend all objects by default, it would potentially duplicate the projections of individual properties.

One option would be to recommend top-level objects by default, but not any properties within them. So you'd end up with columns: foo, bar. TBH, I think I'd personally prefer the current approach where we project separate columns for each of the scalar properties, though I suspect I may not be understanding the problems with the current approach. Are you able to elaborate on how the current behavior is undesirable?

jgraettinger commented 1 year ago

I think we're just talking about top-level objects. All objects with a single-component JSON Pointer would be recommended. All others would not.

psFried commented 1 year ago

So should we stop recommending any nested scalar fields? To use the previous example, it would seem really weird to recommend both foo and foo/a|b|etc. It seems like we should do one or the other. My own intuition is that folks would rather have the individual fields when materializing into a database table, but I'm curious about the scenarios where that's a problem.

jgraettinger commented 1 year ago

🤔 that's a great point (and this is the danger of rapid-fire responses without fully thinking it through). Thoughts @dyaffe ?

Perhaps an illustrative example would help but this request is definitely in tension with the way we currently recommend nested properties under top-level objects.

dyaffe commented 1 year ago

I might have actually gotten the request wrong. Looking at it again, it looks like the issue was actually with an array field. Objects are recommended via their recommended projections, but arrays are not, right?

Maybe what we should do here is instead recommend array fields and do nothing with objects. Let me know your thoughts though @jgraettinger and @psFried

jgraettinger commented 1 year ago

Do we want this just for top-level arrays? or also nested arrays? I'm presuming nested arrays as well, so that both /an_array and also /something/an_array are recommended columns (an_array and something/an_array, respectively).

One other callout is this would put further pressure on size restrictions within warehouses. I think we'd need to pair it with more strategies to bound the complexity of documents when we have to.

dyaffe commented 1 year ago

That makes sense to me.

It does seem like we should do something intelligent around not sync'ing arrays with > x values or the first y or something?

dyaffe commented 1 year ago

Another one that I just ran into with another client -- this may be the actual issue. When a field takes multiple values (an arrray, object, string, int, etc) we don't materialize it by default. In this case, the field is most likely actually an object and we should materialize it by default.

@jgraettinger does this make more sense?

williamhbaker commented 1 year ago

@dyaffe in the case of a field potentially being multiple different types, it has a representation internal to the connector of something distinct from an object - a MULTIPLE pseudo-type, in materialize-sql parlance - and that is indeed a field that is not materialized by default (optional, instead of recommended). Practically, these fields are materialized into the same column type as object fields, typically the equivalent of a JSON column in the endpoint.

Anyway, it would be possible to change these MULTIPLE types to recommended instead of optional, and that would be separate from fields that are singularly typed as objects. The most direct way of doing that would result in all current materializations adding these columns to their tables (if they weren't already explicitly specified) the next time they are published. This would also apply to recommending arrays, where previously unmaterialized columns would be added on re-publications of existing specs, which may not be desirable in all cases. And just for thoroughness, these MULTIPLE types may be arrays as well and that comes with the increased document size & constraints around warehouse restrictions.

Given the issues we've seen previously when changing optional fields to recommended (ex: changing _meta/op from optional to recommended), ideally we'd also have more sophistication in the connector so that it could inspect an existing table when adding a binding and do "something" a) the table already exists and b) the columns it needs aren't there...maybe adding the columns automatically, although I'm not sure if that is necessarily always desirable. This would address the problem with users re-adding materialization bindings that were long ago removed with the tables left in place since when additional fields were recommended, or when they remove & re-add a binding without an intermediate publication that does the alter table add column ... for the new recommended fields.

We'd need to take care of the more common case of these new columns being automatically added to on-going bindings on new publications as well. The inclusion of arrays by default in particular may be pretty widespread. It feels a little bit like we need some mechanism for saying "if this materialization was created before X, recommend these fields, otherwise recommend some other fields", and that may actually solve the problem with removing & re-adding bindings also.

dyaffe commented 1 year ago

@williamhbaker this makes sense. I guess the other alternative would be adding an option to materialize historical data either via a backfill or some other means when a user adds a column to their materialization.

I'm curious what you think would make more sense here? I spoke to one client yesterday for example and they found that they have 36 tables which don't have all the columns they wanted and need to do something about it.

I think the reason that I'm suggesting recommending so many fields by default is because it's hard to add fields now. If we made that easier, it could be the alternative.

I think I lean towards recommending more fields, but curious to hear you thoughts.

williamhbaker commented 1 year ago

Being able to re-backfill a table by changing a backfillVersion property or similar would probably help some when it comes to adding columns, since that would result in the added columns getting their data added, along with reloading everything else. Somebody's going to have to configure all 36 of those tables with the needed columns and re-backfill them all though, in your example.

Recommending more fields does make sense to me, it's mostly a matter of how existing tables are migrated.

dyaffe commented 1 year ago

I'm on the 36 table problem.

And yeah, agreed that recommending more fields makes sense. I don't have a great answer for this field recommendation thing. I don't really think it's a great idea to change every active materialization in our system to add new columns that weren't previously recommended. It would be nice if we could have a way to recommend these fields on materializations that are newer than date x. "Recommended as of" or something, but not sure if that would be at all practical.

williamhbaker commented 8 months ago

Done in https://github.com/estuary/connectors/pull/1225

estuary / flow

Recommend Arrays in Materializations #1254