Closed galamit86 closed 3 years ago
Correction upload_to_gcs
can start before all datasets finish, but still a single large dataset delays that significantly. This may be due to not having free workers to run the tasks.
The main issue stands - find a way to prioritise upload_to_gcs
for a converted dataset over converting other datasets.
Effectively closed by #86
Currently, if running a
statline-bq
flow over multiple datasets, theupload_to_gcs
task starts only once all datasets have been converted to parquet. This is caused due to listingfiles_parquet
as one of theupstream_tasks
dependencies here, like so:This was done as the upload is directed to a folder, and is a non-data dependancy. It cannot be simply removed, as that will cause the
upload_to_gcs
task to run prior to the files being converted.This causes situations where if one of 10 datasets is significantly larger, the other 9 would not be uploaded until it is completed. If an error occurs, (i.e. the VM shuts down) - they are all unnecessarily erased.
This dependancy should be more nuanced - and prioritise upload of a dataset whenever possible.