Add support for partitions and data types

freimer commented 3 years ago

Problem

There is currently no support for partitions in the target-athena loader. Additionally, all fields are forced to STRING.

Proposed changes

Added partition_keys as a target property. This is a list of fields that are partition keys. Uploads files to S3 and inserts the partition keys before the filename as required, and modifies the ATHENA query to include the appropriate PARTITION clause.

Note that partition keys are optional, and as far as this new feature is concerned, it should behave the same way as before if no partition keys are specified.

The other change is to fix the fields. Current behavior is that the option to set the field type to the schema type is commented out, and everything is forced to a STRING. Athena has specific types, just like any database, and everything does not map directly from JSON. I "fixed" this by mapping certain JSON types to the appropriate Athena type. Note that this is a breaking change as this will change the types for fields loaded into Athena. However, they should be the "correct" type.

Types of changes

What types of changes does your code introduce to PipelineWise? Put an x in the boxes that apply

[X] Bugfix (non-breaking change which fixes an issue)
[X] New feature (non-breaking change which adds functionality)
[X] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[X] Documentation Update (if none of the other choices apply)

Checklist

[ ] Description above provides context of the change
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] Unit tests for changes (not needed for documentation changes)
[ ] CI checks pass with my changes
[ ] Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
[ ] Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
[ ] Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
[ ] Commits follow "How to write a good git commit message"
[ ] Relevant documentation is updated including usage instructions

aaronsteers commented 2 years ago

@freimer - Thanks very much for submitting this! How are you thinking about date-based partitioning? Would that be encompassed in this PR or a future improvement?

freimer commented 2 years ago

I was actually thinking of scrapping this... Sort of. Here's what I thought after I wrote the partition change:

Should partition configurations / specifications be on the end of the pipeline, or on the front-end? Realize that we are talking about the initial data ingestion with this part of Meltano and Singer tap/targets in general. That is, getting the data from a variety of sources to some datalake / data warehouse. After that initial EL process comes the T, possibly with DBT or other solutions. DBT can do partitions by itself when taking "raw"-ish data from a landing zone or staging location and normalizing the data to a canonical form and into an analysis or consumption DB. What we are really talking about here is the initial LZ or staging area.

In that case, do we really want "I want to partition this data this way" in the target, which should be able to handle many / any data sent to it, or do we want that in the tap/extractor, which by its very nature understands what the data is and how best to partition the data, if at all? I'm thinking it makes sense to have the partition configuration in the source/tap, rather than the destination/target.

What would be the best / most flexible way to implement that? Just like the Singer spec itself shows, you can have a string with a "format" that tells the targets that the string is really a timesamp / date-time. Would we be able to include a "partition-hint" attribute in the schema for the data that, if the target supports it (and no targets would need to support it), then it could use that information to partition the data. The schema is sent to the target before any data is sent, so it can use that information and either ignore it, or support partitioning the data based on the hints.

This would remove the current limitation of having one configuration for partitions, when a target may be sent several streams of data with different schemas. It would get unwieldy to have target configurations for partitions based on every possible type of schema sent to it, so again this probably makes more sense to include as an attribute in the schema.

So that's what I was thinking. And as far as date-base partitions, that would fit in nicely with the partition hints in the schema. If the field given as a suggested partition was a string with a "date-time" format, or perhaps a Unix epoch time specified as an Integer or Number and some other attribute that says to interpret that as a timestamp, and an attribute to indicate what the suggested breakout is (year only, year month, specific day, hourly, etc).

Actually, re-reading the spec itself, it seems a bit lacking in how this type of thing is handled. It has metadata, but it does not say that metadata, which would be in a catalog, is passed in any way to a target. Just SCHEMA and RECORD items. So, either we would need to extend the spec to add additional attributes to the SCHEMA, or change/clarify the spec that METADATA items can also be sent, which targets that don't support them should ignore, but those that do support can use for various purposes.

Either option has risk, depending on how targets are written, but I see no way around the goal. What do you think?

tayloramurphy commented 1 year ago

Closing due to inactivity.

MeltanoLabs / target-athena