Please describe the feature you'd like to see
Currently, we load files to the table sequentially. We can alternatively use get_file_list to load_file parallelly.
We can have a partial approach where get_file_list can produce buckets for files like where we can specify the max bucket size by get_file_list(path=GCS_BUCKET, conn_id=ASTRO_GCP_CONN_ID, max_bucket_size=3):
Then, load_file should be able to process the list of files so that we can load file parallelly and simultaneously control the number of tasks generated. Since there are only two items generated it will result in two tasks and not six tasks.
Are there any alternatives to this feature?
Open to suggestions
Acceptance Criteria
[ ] All checks and tests in the CI should pass
[ ] Unit tests (90% code coverage or more, once available)
[ ] Integration tests (if the feature relates to a new database or external service)
[ ] Example DAG
[ ] Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
[ ] Exception handling in case of errors
[ ] Logging (are we exposing useful information to the user? e.g. source and destination)
[ ] Improve the documentation (README, Sphinx, and any other relevant)
Please describe the feature you'd like to see Currently, we load files to the table sequentially. We can alternatively use
get_file_list
to load_file parallelly.But in the above code,
get_file_list
can result in 100s of files, which will result in 100s of tasks.Describe the solution you'd like for example, GCS_BUCKET has the following files:
Currently
get_file_list
will produce:We can have a partial approach where
get_file_list
can produce buckets for files like where we can specify the max bucket size byget_file_list(path=GCS_BUCKET, conn_id=ASTRO_GCP_CONN_ID, max_bucket_size=3)
:Then, load_file should be able to process the list of files so that we can load file parallelly and simultaneously control the number of tasks generated. Since there are only two items generated it will result in two tasks and not six tasks.
Are there any alternatives to this feature? Open to suggestions
Acceptance Criteria