Airflow process - CBS data backfill - Githubissues

data-for-change / anyway-etl

MIT License

0 stars 7 forks source link

Airflow process - CBS data backfill #9

Open atalyaalon opened 3 years ago

atalyaalon commented 3 years ago

Create an airflow process that allows CBS data backfill from s3 (without importing from email) - and with a load_start_year parameter that can be changed by the airflow user The relevant command: (python main.py process cbs --source s3 --load_start_year 2020 We had a Jenkins process that enabled such backfill.

OriHoch commented 3 years ago

we are not using S3 at the moment, we have all the files in the airflow data directory - https://airflow-data.anyway.co.il/cbs/files/

OriHoch commented 3 years ago

there is no point to have load_start_year 2020 as that is the default (current year - 1)

OriHoch commented 3 years ago

development complete, deployed to dev to run a back-fill - do a manual dag run with the following example json:

{"load_start_year": 2019}

initiated 2 dag runs on dev for testing:

assigning to @atalyaalon to test and make a release

atalyaalon commented 2 years ago

@OriHoch seems like 2020 data is not in our DB even though default load_start_year is 2020 as you've mentioned - seems like data from 2020 and 2021 in that case is deleted however only data from 2021 is loaded in the CBS process (from s3). Can you take a look?

OriHoch commented 2 years ago

how do you check this?

atalyaalon commented 2 years ago

@OriHoch When load start year is 2020, then data starting 2020 and on is deleted. Now the question is what data is loaded from s3? I assume datat from 2020 is not loaded for some reason

OriHoch commented 2 years ago

we are not loading any data from s3

how did you check that data in DB from 2020 is not loaded?

atalyaalon commented 2 years ago

Why not loading from s3? I queried markers table in our DB, no accidents in 2020

OriHoch commented 2 years ago

all the data is available in our storage, S3 is not used at all at the moment - https://airflow-data.anyway.co.il/cbs/files/

could you write the query you used?

atalyaalon commented 2 years ago

https://app.redash.io/hasadna/queries/1008428 No 2020 here

atalyaalon commented 2 years ago

@OriHoch we would like to extract updated data by tomorrow evening (monday) for a specific report. Is it possible you'll take a look tomorrow morning? Before moving to airflow, the cbs process used the load_start_year var to extract the relevant data from s3, all the data starting this year, and in this way we didn't have holes in the data.

OriHoch commented 2 years ago

I don't think I'll be able to do it that soon

atalyaalon commented 2 years ago

@OriHoch no worries - I'm taking care of it for now using Jenkins :) I will soon add a description here of the steps that needed to be done - but no rush - working on a quick solution for now, until you'll be available to fix airflow (disabled the airflow cbs process for now). Thanks for everything!

OriHoch commented 2 years ago

select 
        accident_year,
        accident_severity_hebrew,
        count(*) 
    from markers_hebrew 
where accident_severity_hebrew is not null
group by 
    accident_year,
    accident_severity_hebrew
order by accident_severity_hebrew, accident_year

atalyaalon commented 2 years ago

Hi @OriHoch, anyway-etl implementation is not ready - there is a logical bug there: There should be a full separation between:

Email extraction and loading into S3 process - so all data will be saved in S3 as we did in previous anyway process implementation (in anyway repo)
Loading to DB - from s3 - so we can reload relevant data from any load_start_year (in the code we first delete data from relevant years and then reload new data)

OriHoch commented 2 years ago

@atalyaalon this separation exists, each step of the dag is completely independent as you wrote in your comment - https://github.com/hasadna/anyway-etl/blob/main/airflow_server/dags/cbs.py#L17

regarding S3 - we replaced it with the local data storage which is available here, if you want we can copy that over to S3 too

atalyaalon commented 2 years ago

@OriHoch Thanks.

If so - perhaps it's better to copy it only to S3 - and not local storage. And then load data to DB using S3 files and not local storage. Meaning not using local storage in these 2 processes

The reason for that is that we might want to reload multiple previous years (and the email process only loads latest years) AND that it's important for us to maintain all files in S3.

atalyaalon commented 2 years ago

Let me clarify - I think that the consistent location to save the files data is not local storage rather than s3 - That's why I think that local storage should not be used for data consistency

OriHoch commented 2 years ago

ok, please open a new issue for this and we can discuss it there, it's not related to the CBS data backfill

atalyaalon commented 2 years ago

Thanks! opened https://github.com/hasadna/anyway-etl/issues/17

OriHoch commented 2 years ago

fixed in #16 (pending merge & deploy)