Open atalyaalon opened 3 years ago
we are not using S3 at the moment, we have all the files in the airflow data directory - https://airflow-data.anyway.co.il/cbs/files/
there is no point to have load_start_year 2020 as that is the default (current year - 1)
development complete, deployed to dev to run a back-fill - do a manual dag run with the following example json:
{"load_start_year": 2019}
initiated 2 dag runs on dev for testing:
assigning to @atalyaalon to test and make a release
@OriHoch seems like 2020 data is not in our DB even though default load_start_year is 2020 as you've mentioned - seems like data from 2020 and 2021 in that case is deleted however only data from 2021 is loaded in the CBS process (from s3). Can you take a look?
how do you check this?
@OriHoch When load start year is 2020, then data starting 2020 and on is deleted. Now the question is what data is loaded from s3? I assume datat from 2020 is not loaded for some reason
we are not loading any data from s3
how did you check that data in DB from 2020 is not loaded?
Why not loading from s3? I queried markers table in our DB, no accidents in 2020
all the data is available in our storage, S3 is not used at all at the moment - https://airflow-data.anyway.co.il/cbs/files/
could you write the query you used?
https://app.redash.io/hasadna/queries/1008428 No 2020 here
@OriHoch we would like to extract updated data by tomorrow evening (monday) for a specific report. Is it possible you'll take a look tomorrow morning? Before moving to airflow, the cbs process used the load_start_year var to extract the relevant data from s3, all the data starting this year, and in this way we didn't have holes in the data.
I don't think I'll be able to do it that soon
@OriHoch no worries - I'm taking care of it for now using Jenkins :) I will soon add a description here of the steps that needed to be done - but no rush - working on a quick solution for now, until you'll be available to fix airflow (disabled the airflow cbs process for now). Thanks for everything!
select
accident_year,
accident_severity_hebrew,
count(*)
from markers_hebrew
where accident_severity_hebrew is not null
group by
accident_year,
accident_severity_hebrew
order by accident_severity_hebrew, accident_year
Hi @OriHoch, anyway-etl implementation is not ready - there is a logical bug there: There should be a full separation between:
@atalyaalon this separation exists, each step of the dag is completely independent as you wrote in your comment - https://github.com/hasadna/anyway-etl/blob/main/airflow_server/dags/cbs.py#L17
regarding S3 - we replaced it with the local data storage which is available here, if you want we can copy that over to S3 too
@OriHoch Thanks.
If so - perhaps it's better to copy it only to S3 - and not local storage. And then load data to DB using S3 files and not local storage. Meaning not using local storage in these 2 processes
The reason for that is that we might want to reload multiple previous years (and the email process only loads latest years) AND that it's important for us to maintain all files in S3.
Let me clarify - I think that the consistent location to save the files data is not local storage rather than s3 - That's why I think that local storage should not be used for data consistency
ok, please open a new issue for this and we can discuss it there, it's not related to the CBS data backfill
Thanks! opened https://github.com/hasadna/anyway-etl/issues/17
fixed in #16 (pending merge & deploy)
Create an airflow process that allows CBS data backfill from s3 (without importing from email) - and with a load_start_year parameter that can be changed by the airflow user The relevant command: (python main.py process cbs --source s3 --load_start_year 2020 We had a Jenkins process that enabled such backfill.