feat(`COPY TO`): hive partitioning support

melbourne2991 commented 3 months ago

Addresses (https://github.com/GlareDB/glaredb/issues/2462)

Provides hive partitioning support for Parquet & Json.

Missing from this PR:

Remaining formats (lance, csv, bson)
Reading hive partitioned files

CLAassistant commented 3 months ago

All committers have signed the CLA.

melbourne2991 commented 3 months ago

this all looks really good to me.

I'd love to see some tests on "what happens when you partition by something other than a date"

we should definitely open another ticket for "reading from hive-partitioned files." right now you can use our glob, function, and that helps, but there is a push down projection that this might not be able to do. Clearly out of scope for this ticket, but it'd be killer feature either way.

I'd love to see a test with another file format (json or bson?) just just to make sure that it's generic enough and doesn't rely on something parquet specific.

I think it'd be good to be explicit about the expectation that the partitioned field remains in the output data or is elided because it's in the partition, so a test there would be good.

Agree on all these points - thanks for the feedback. (Just a note: the PR isn't in its final form yet. The current test was primarily for development ease - more comprehensive tests are on the way!).

universalmind303 commented 2 months ago

marking as draft as it's not actively waiting on review.

@melbourne2991 please feel free to ping us when it is ready.

tychoish commented 2 months ago

@melbourne2991 wanted to check in on this. Is there anything I can do to help you on this?

melbourne2991 commented 2 months ago

hey @tychoish, apologies, I've been swamped lately - I'm not sure if I'll have time to get around to this in any reasonable time frame - happy for someone else to pick it up, there shouldn't be too much effort left on it I hope

GlareDB / glaredb

feat(`COPY TO`): hive partitioning support #2634