breedfides / airflow-etl

0 stars 2 forks source link

Implement writing to S3 storage at de.NBI #13

Closed gannebamm closed 5 months ago

gannebamm commented 6 months ago

We have 80 GB of object storage for the pilot ready. Please use it to write the DAG outputs into a web-accessible folder. Since the data that will be written is not private, you do not have to implement additional security measures.

More info regarding our object storage provider: https://cloud.denbi.de/wiki/Compute_Center/Bielefeld/#object-storage

@feserm @arendd : After the successful write operation, shall a JSON payload with the dag-run-id and the info about the storage location of its outputs be sent to a URL? How exactly do you want to capture that event? We can also schedule a joint call for that matter.

gannebamm commented 6 months ago

As per our todays meeting:

API info: https://airflow.apache.org/docs/apache-airflow/2.8.2/stable-rest-api-ref.html#section/Overview

If a DAG run is triggered by POST request you will receive a JSON with its DAG-RUN-ID. See https://airflow.apache.org/docs/apache-airflow/2.8.2/stable-rest-api-ref.html#operation/post_dag_run

This will be used to store the output and also query the current status of the DAG run.

storage: The OBS path shall use the DAG-RUN-ID like this:

https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/BreedFidesETL-OBS/manual__2024-02-29T12:32:25.203149+00:00/radiation_global_20240228_0949.nc

where manual__2024-02-29T12:32:25.203149+00:00 is the DAG-RUN-ID and radiation_global_20240228_0949.nc is one of the files. All environmental files shall be put there.

current status of the DAG run: The Endpoint https://airflow.apache.org/docs/apache-airflow/2.8.2/stable-rest-api-ref.html#operation/get_dag_run (eg https://breedfides-airflow.bi.denbi.de/api/v1/ui/#/DAGRun/get_dag_run) can be used to query for a specific DAG run. In there you will get info about its current state. The frontend can query this to get notified when the run is finished.