breedfides / airflow-etl

0 stars 2 forks source link

Implement writing to S3 storage at de.NBI #14

Closed brightemahpixida closed 5 months ago

brightemahpixida commented 6 months ago

Overview: This pull request adds a new task to the DAGs (fetch_cdc_radiation_DAG, fetch_cdc_air_temp_DAG, fetch_gpkg_soil_data_DAG) - the new task writes the clipped output to an S3 bucket

Changes Made:

gannebamm commented 6 months ago

Thanks @brightemahpixida for the fast work you have done here. Is this already deployed in our proof-of-concept infrastructure at de.nbi? I would like to test it there.

gannebamm commented 6 months ago

I see this is an implementation not only for soil but for all DAGs. Please confirm and change title.

The last open question is how to send the S3 file location to the frontend. There was an open question in the issue #13 for @arendd and @feserm

brightemahpixida commented 6 months ago

Hi @gannebamm,just got your message now - i was about to send in an update to you on this :)

To your earlier question on if this has been deployed to the de.NBI instance - yes, it has and you could also test this out on your end.

Prior before working on this, i was able to create a object store container on the de.NBI dashboard, the name of which is BreeedFides-OBS, the clipped outputs will be stored on this container - if you navigate to the object store panel on the de.NBI dashboard you should see three distinctive folders has already been created i.e. the air_temperature_mean, radiation_global and soil (See attached screenshot). All three folders will hold the clipped outputs.

I was hoping to find out from you if you intend to download the outputs through this medium via the de.NBI dashboard

BreedFides-OBS

brightemahpixida commented 6 months ago

I see this is an implementation not only for soil but for all DAGs. Please confirm and change title.

The last open question is how to send the S3 file location to the frontend. There was an open question in the issue #13 for @arendd and @feserm

Oh i think this is the answer to the question i posted on my last comment

gannebamm commented 6 months ago

We will discuss this in a joint meeting. @vineetasharma105 already sent a scheduling request to the group.

Thanks @brightemahpixida ,that looks very promising!

brightemahpixida commented 6 months ago

Oh Ok great.

Also to add, while working on this topic, i generated the de.NBI application credentials containing the access and secret keys for the Object store container. I will be forwarding these configurations to you via email shortly.

brightemahpixida commented 6 months ago

Hi @gannebamm - i'm currently testing out the DAGs so there's a chance you might get some errors, i'll notify you once its ready

brightemahpixida commented 6 months ago

Hi @gannebamm - I was able to work out the new implementation as discussed during our sync on Thursday, we now have the primary_DAG returning a success state when the downstream DAGs (i.e. the Soil, Air-Temp and Radiation) are complete AND a running state when execution is in place.

I also changed the directory pattern on the Object store container to match what was described during the meeting on Thursday, now if you visit the OBS-link you should see the clipped-output nested within the DAG-RUN-ID for all three DAGs.

Happy to make more adjustment if needed :)

brightemahpixida commented 6 months ago

Hi @gannebamm - I made a couple more adjustment that will allow the end user trigger any of the 3 DAGS (soil, radiation_global and air_temp) independently without necessarily executing all of them at once via the primary_DAG - so basically this is an added option that can be used by the user.

I can recall this was also a point/suggestion raised during the meeting we had last week.

gannebamm commented 5 months ago

@arendd @feserm could you please try to test this with a POST request? With that beeing positive we can be sure it works also from your side.