astronomer / apache-airflow-providers-transfers

https://apache-airflow-provider-transfers.rtfd.io/
Apache License 2.0
11 stars 3 forks source link

Split the Table to multiple Files #25

Open sunank200 opened 1 year ago

sunank200 commented 1 year ago

Please describe the feature you'd like to see

We can also have multiple file — good to have not a must-have feature

  1. we can also have multiple files, if the table is having data in GBs.
  2. what should be naming scheme: test.csv —> test_1.csv, test_2.csv
  3. We can assume a default file_size_threshold, once reached we can split data into multiple file.

    ```
    transfer_non_native_bigquery_to_sqlite = UniversalTransferOperator(
        task_id="transfer_non_native_bigquery_to_sqlite",
        source_dataset=Table(
            name="uto_s3_to_bigquery_table", conn_id="google_cloud_default", metadata=Metadata(schema="astro")
        ),
        destination_dataset=File(name="uto_bigquery_to_sqlite_table", type=FileType.PARQUET, conn_id="sqlite_default"),
    # threshold_file_size=500MB
    )
    
    Assume:
    threshold_file_size=1GB
    
    OUPUT:
    uto_bigquery_to_sqlite_table_1 <- 1GB
    uto_bigquery_to_sqlite_table_2 <- 100MB
    ```

Describe the solution you'd like Exporting a huge table into multiple smaller files allows users to effectively parallelise the transformation afterwards, using tools like Spark and Beam.

Additional context More details at: notion doc

Acceptance Criteria