CityofPittsburgh / data-rivers

Apache Airflow and Beam ETL scripts for the City of Pittsburgh's data analysis pipelines
10 stars 1 forks source link

Add DIRECT Keyword to GCS Export Statement to Prevent Sharding #644

Closed jasonfic closed 9 months ago

jasonfic commented 9 months ago

Adding a DISTINCT clause to the direct export query forces all data to be loaded into one worker, which prevents the results of the query from being broken up into several files. If it is important that non-distinct rows are exported, then we will need to use a standard BigQueryToCloudStorageOperator.