Open zbrak opened 2 years ago
Have updated issue with new details presented to me:
Edit: the GCS connector 'Bucket Path' setup has similar (but incomplete) functionality to the S3 connector but is undocumented. Critically the S3 connector includes an option to terminate the path with a / - overriding existing {stream} values in the path. This screenshot shows the error thrown when trying to perform the same functionality in GCS connector (0.2.10):
This would be great for all object storage destinations. Hive partitioning is almost standard practice, developers are used to read from object storages that way and BigQuery, Redshift or Apache Spark all use hive partitioning. We can add hive partitioning as an extra step in our data pipeline but having it done right away is more elegant.
Not directly linked to this issue but probably to be included in a same release as impacting the generated output file path on GCS : https://github.com/airbytehq/airbyte/issues/28528 🙏
At Airbyte, we seek to be clear about the project priorities and roadmap. This issue has not had any activity for 180 days, suggesting that it's not as critical as others. It's possible it has already been fixed. It is being marked as stale and will be closed in 20 days if there is no activity. To keep it open, please comment to let us know why it is important to you and if it is still reproducible on recent versions of Airbyte.
Hello, looks like this is still open We need for this feature in order to use hive partitioning for incrementally load data limiting data retrieval costs
Tell us about the problem you're trying to solve
Edit: the GCS connector 'Bucket Path' setup has similar (but incomplete) functionality to the S3 connector but is undocumented. Critically the S3 connector includes an option to terminate the path with a
/
- overriding existing ${namespace} and ${stream} values in the path. This screenshot shows the error thrown when trying to perform the same functionality in GCS connector(0.2.10)
:GCS connector has been really good for us so far, however, when trying to couple with federated queries in BigQuery, it can hurt your optimization because you are unable to partition the underlying table.
I'm hoping to show that Airbyte can support robust partitioning with some simple path manipulation on insert.
Describe the solution you’d like
Edit: we've confirmed in Airbyte slack much of this custom path functionality exists in the S3 connector and is available but undocumented in the GCS connector.
I'd like to see a modal in the GCS destination configuration "Enable hive partitioning" with a field name value. Hive partitioning is a GCS external table default option in BigQuery, and could really enhance a lot of lakehouse-type implementations. Link to the hive partition spec.
The current pathing as is:
With a slight adjustment we would enable BigQuery to infer a partition (where ${field_name} is the partition field with inferred type
date
):Describe the alternative you’ve considered or used
The alternative I've been doing to not read my entire bucket every time is to rapidly refresh the external table definition, dynamically creating paths to filter to the day. i.e.
testing_bucket/data_output_path/public/users/2021_01_01
Additional context
External table definition link. Hive partitioning spec link.
Link to getOutputFilename in the GCS connector
Are you willing to submit a PR?
We do have a resource we can put on this, however as this is such a core connector I feel like our specific implementation may run into problems when trying to establish consensus.