apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 984 forks source link

DRILL-8375: Support for non-projected complex vectors #2867

Closed paul-rogers closed 6 months ago

paul-rogers commented 6 months ago

Support for non-projected complex vectors

Description

The EVF mechanism provides scan-time projection for many vector types. The reader code is simple: it deserializes all columns for formats such as JSON, CSV, etc., and writes them to the ColumnWriter objects. Internally, EVF simply ignores the data for unprojected columns. This solution simplifies the readers: it is not necessary for each reader to include the complex code to handle projection. This solution is also performant: projection is done at scan time rather than the other approach, which is to read all data into vectors, then allow a PROJECT operator to drop the unprojected columns.

Present EVF projection support handles most scalar and "well-structured" columns (repeated types AKA arrays, maps, etc.) However it does not handle the more esoteric types UNION, LIST (AKA repeated UNION), REPEATED LIST (AKA repeated, repeated UNION). This PR provides more support, though holes remain. In particular, both the UNION and LIST types now provide scan-time non-projection support.

Documentation

This is an internal feature: no user-visible documentation is required.

Testing

Extended existing EVF-related unit tests. There is an entire new test file to exercise the conditions of interest.

cgivre commented 6 months ago

Once we merge this, we should also rebase https://github.com/apache/drill/pull/2515 on the current master and merge that as well.

paul-rogers commented 6 months ago

Thanks @cgivre for the comments and review. @luocooong, I'll commit this. When convenient, please see if this addresses the issue you raised long ago. Otherwise, these capabilities are available for the next person who is seduced into trying out the UNION-based types.

jnturton commented 6 months ago

I think we can regard this as a bug fix to framework code already in use in 1.21 and therefore backport it.

paul-rogers commented 6 months ago

Backporting should be safe: as safe as having the change in master itself.

paul-rogers commented 6 months ago

I successfully squashed the commits, and provided a proper commit message, while preserving the later commit. Did a force push to master to rewrite history.

You should update your own master to pick up the revised history.

@cgivre, the .asf.yaml file you mentioned has lots of metadata, but does not actually prevent a force push. Perhaps we are missing something? It would generally be a good idea to forbid such things to prevent catastrophic mistakes.

cgivre commented 6 months ago

I successfully squashed the commits, and provided a proper commit message, while preserving the later commit. Did a force push to master to rewrite history.

You should update your own master to pick up the revised history.

You are a braver man than I.

@cgivre, the .asf.yaml file you mentioned has lots of metadata, but does not actually prevent a force push. Perhaps we are missing something? It would generally be a good idea to forbid such things to prevent catastrophic mistakes.

Thanks for flagging... I'm not sure how to do that, but I'll investigate.

jnturton commented 6 months ago

@cgivre, the .asf.yaml file you mentioned has lots of metadata, but does not actually prevent a force push. Perhaps we are missing something? It would generally be a good idea to forbid such things to prevent catastrophic mistakes.

Oh that's interesting. Something's changed since I last went through this with @vvysotskyi to do something that could only be done on master, perhaps testing of the automatic snapshot artifact publishing which requires access to GitHub Actions secrets.