airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.2k stars 4.14k forks source link

Destination GCs: add hive partitioning in paths #15456

Open zbrak opened 2 years ago

zbrak commented 2 years ago

Tell us about the problem you're trying to solve

Edit: the GCS connector 'Bucket Path' setup has similar (but incomplete) functionality to the S3 connector but is undocumented. Critically the S3 connector includes an option to terminate the path with a / - overriding existing ${namespace} and ${stream} values in the path. This screenshot shows the error thrown when trying to perform the same functionality in GCS connector (0.2.10): Screen Shot 2022-08-09 at 12 54 24 PM

GCS connector has been really good for us so far, however, when trying to couple with federated queries in BigQuery, it can hurt your optimization because you are unable to partition the underlying table.

I'm hoping to show that Airbyte can support robust partitioning with some simple path manipulation on insert.

Describe the solution you’d like

Edit: we've confirmed in Airbyte slack much of this custom path functionality exists in the S3 connector and is available but undocumented in the GCS connector.

I'd like to see a modal in the GCS destination configuration "Enable hive partitioning" with a field name value. Hive partitioning is a GCS external table default option in BigQuery, and could really enhance a lot of lakehouse-type implementations. Link to the hive partition spec.

The current pathing as is:

testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.jsonl
↑              ↑                ↑      ↑     ↑          ↑             ↑ ↑
|              |                |      |     |          |             | format extension
|              |                |      |     |          |             partition id
|              |                |      |     |          upload time in millis
|              |                |      |     upload date in YYYY-MM-DD
|              |                |      stream name
|              |                source namespace (if it exists)
|              bucket path
bucket name

With a slight adjustment we would enable BigQuery to infer a partition (where ${field_name} is the partition field with inferred type date):

testing_bucket/data_output_path/public/users/${field_name}=2021-01-01/1609541171643_0.jsonl
                                             ↑             ↑
                                             |             upload date as partition
                                             Chosen field name as partition field

Describe the alternative you’ve considered or used

The alternative I've been doing to not read my entire bucket every time is to rapidly refresh the external table definition, dynamically creating paths to filter to the day. i.e. testing_bucket/data_output_path/public/users/2021_01_01

Additional context

External table definition link. Hive partitioning spec link.

Link to getOutputFilename in the GCS connector

Are you willing to submit a PR?

We do have a resource we can put on this, however as this is such a core connector I feel like our specific implementation may run into problems when trying to establish consensus.

zbrak commented 2 years ago

Have updated issue with new details presented to me:

Edit: the GCS connector 'Bucket Path' setup has similar (but incomplete) functionality to the S3 connector but is undocumented. Critically the S3 connector includes an option to terminate the path with a / - overriding existing {stream} values in the path. This screenshot shows the error thrown when trying to perform the same functionality in GCS connector (0.2.10):

dis-sid commented 2 years ago

This would be great for all object storage destinations. Hive partitioning is almost standard practice, developers are used to read from object storages that way and BigQuery, Redshift or Apache Spark all use hive partitioning. We can add hive partitioning as an extra step in our data pipeline but having it done right away is more elegant.

kev-datams commented 1 year ago

Not directly linked to this issue but probably to be included in a same release as impacting the generated output file path on GCS : https://github.com/airbytehq/airbyte/issues/28528 🙏

octavia-squidington-iii commented 7 months ago

At Airbyte, we seek to be clear about the project priorities and roadmap. This issue has not had any activity for 180 days, suggesting that it's not as critical as others. It's possible it has already been fixed. It is being marked as stale and will be closed in 20 days if there is no activity. To keep it open, please comment to let us know why it is important to you and if it is still reproducible on recent versions of Airbyte.

kev-datams commented 6 months ago

Hello, looks like this is still open We need for this feature in order to use hive partitioning for incrementally load data limiting data retrieval costs