The current implementation is unscalable and gets slower with each new day

sglavoie commented 4 years ago

Our current implementation processes data inefficiently: the end result is that GitHub Actions often terminates with a timeout error. For this reason, issues such as #75, #79 and #73 are being reported. The reason for this seems to be a new column that is added each day to the time series dataset in the upstream increases the processing time for the flow and recently it has gone to time more than 6 hours.

When I fetch data from this repository, I want to access the latest data being published from the sources in use so I can rely on it and use it elsewhere consistently for further processing and/or analysis.

Acceptance

[ ] Data is processed in a timely manner: at most a few minutes for each run with GitHub Actions.
[ ] Data is updated consistently at most every 6 hours without failing (this can be lowered as much as desired with the suggested solution).

Tasks

[ ] Check if a date is missing in a CSV file, starting from today and going backward in time. Once a date is found, we have a list of all the missing dates.
- [X] Fetch data only from daily reports from those missing dates.
- [x] Append new data to existing files by iterating over rows once.
- [X] Implement for countries-aggregated.csv
- [X] Implement for key-countries-pivoted.csv
- [x] Implement for time-series-19-covid-combined.csv
- To be improved: data can be captured in a better way regarding recoveries by using the daily reports.
- [ ] Implement for us_confirmed.csv
- [ ] Implement for us_deaths.csv
- [x] Implement for worldwide-aggregated.csv
[X] Download and update reference CSV.
[ ] Minimally process data in the end, i.e. update cumulative results only.
[ ] Clean existing data and extract coordinates (latitude and longitude) in a separate file.

Optional (to confirm):

[ ] Reduce delay between runs for GitHub Actions (2 hours is probably a good enough delay).
- [ ] Make sure to commit only when data is actually updated.

Analysis

How we plan to fix this

Instead of processing all the data from scratch every single day, simply retrieve daily reports and append to existing data:
- This would lead to minimal processing. Daily reports are already being generated here: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports
- This basically becomes a fetching task requiring almost constant time (daily reports do not grow asymptotically in size each day) instead of a processing task that will only take more and more time as it considers the cumulative amount of data.

A proof of concept has been created. Currently, it works well to retrieve part of the necessary data (to generate countries-aggregated.csv and key-countries-pivoted.csv) and does so in a matter of a second or two.

Proof of concept available here.

Technique behind the envisioned solution

Check if a date is missing in a CSV file, starting from today and going backward in time. Once a date is found, we have a list of all the missing dates.
Fetch data only from daily reports from those missing dates.
Append new data to existing files by iterating over rows once.
Minimally process data in the end, i.e. update cumulative results only.
Because the script would check if dates are missing before doing anything else, it would just finish running almost instantly if no new data has been found.

What can be improved further

This should probably be considered as part of this proposed solution since it doesn't cause any technical burden and improves the solution even more:

Process latitude and longitude only once, separately. This comes from #1. Countries usually don't move noticeably enough for those values to need updates (a few inches a year, roughly...).
Cumulative results should be calculated once per execution at the end, once all data has been fetched and processed. This would work well with #22.

/cc @rufuspollock

sdangt commented 4 years ago

This makes sense. The data here has fallen well behind. I do like the format that you use and the way you provide it. Aside from the slowness, which you are planning to fix, nice job.

sdangt commented 4 years ago

I am actually a little confused as to why it was being done that way before. I just tallied the numbers for cases myself in Excel and it took about 5 seconds for the most recent day...to do it manually.

At least for the worldwide aggregated, the old method seems like a lot of unnecessary processing. It may make sense if you are starting from scratch. However, even for doing each day, it could be easier to process each file -- which is labeled neatly with the date -- and then if one or two dates don't process, you can always go back and redo only those.

rufuspollock commented 4 years ago

@sglavoie good analysis 👏 i'm kind of surprised that processing time is 6h - this seems extraordinary. But anyway, let's get your fix in asap as it seems good and we want to get up to date asap.

paulmz1 commented 4 years ago

I think it may be taking 6h due to running out of memory as process.py takes less than 6 min on my laptop but uses 1GB of memory.

The following code generates us_confirmed.csv in 5 seconds using Pandas.

    import pandas as pd
    df = pd.read_csv(BASE_URL + CONFIRMED_US, dtype={'Lat': str, 'Long_': str}) # dd.SOURCE_DIR +, dd.BASE_URL +
    df.set_index(['UID','iso2','iso3', 'code3', 'FIPS', 'Admin2','Lat','Combined_Key',
                  'Province_State', 'Country_Region', 'Long_'], inplace=True)
    df = df.stack()
    df = df.reset_index().set_index('UID')
    df.rename(columns={"Long_": "Long", "Country_Region": "Country/Region", "Province_State": "Province/State",
                       "level_11": "Date", 0: "Case"}, errors="raise", inplace=True)
    df = df[df.columns[[0,1,2,3,4,5,6,10,11,9,8,7]]]
    df["FIPS"].fillna("", inplace=True)
    df['Date'] = pd.to_datetime(df['Date'])

    df.to_csv(DEST_DIR + US_CONFIRMED)

Link to the Full code on PasteBin

sglavoie commented 4 years ago

Thank you @sdangt and @paulmz1 for chiming in!

As a quick update, I have now implemented the necessary functions to deal with a good part of the data in a very scalable way by retrieving only daily updates (not yet published here as it's still a work in progress).

The dataflows package is very powerful: the problem we are facing has something to do with the way the script is currently working, not really with what tools are being used. Of course, pandas is fantastic, but we are trying to keep dependencies very light and usually rely on dataflows first. However, in this case, I bet we can do without both of them as the amount of processing to be done is very minimal.

I concur, Pandas works great. :wink:

anuveyatsu commented 4 years ago

Problem is that source data can be updated/fixed for previous dates, e.g., it has number of issues/mistakes that are fixed later by PR etc. So fetching data only for missing dates might not be the best solution:

Fetch data only from daily reports from those missing dates.

Note that data is actually updated only once in 24h (at least that was the case in April):

Reduce delay between runs for GitHub Actions (2 hours is probably a good enough delay).

I don't think processing a job actually takes 6h - it actually takes something around 10-20 mins.

sglavoie commented 4 years ago

Thank you for chiming in @anuveyatsu!

So fetching data only for missing dates might not be the best solution

Yes, I believe the proposed solution would probably need to retrieve all the data from the very beginning to ensure that the formatting is correct and that possible updates published in the original source have been picked up.

Note that data is actually updated only once in 24h (at least that was the case in April)

From what I can tell in the GitHub workflow on line 3 and from the frequency of commits:

 - cron:  '0 */6 * * *'

This is being updated every 6 hours, although it's not really necessary.

I don't think processing a job actually takes 6h - it actually takes something around 10-20 mins.

Yes, you are right: most of the jobs terminate within this range. Some jobs however took literally between 3 hours and 6 hours to run, at which point it times out. Some examples:

However, the latest failure of this type was about 2 weeks ago. The script still fails from time to time, but lately it was when pushing the data to DataHub only and a login issue was fixed about this since then, so it might be all working again well enough.

rufuspollock commented 4 years ago

@sglavoie can we close this now? Is this fixed?

sglavoie commented 4 years ago

@rufuspollock, I think we can close this for now and come back to it later if needed, especially given the fact that it mostly works as is now and reworking the script from scratch would take longer than what was initially anticipated.

The script runs into problems from time to time in GitHub Actions (further debugging would be needed to understand where exactly that happens since this is inside a "flow" going through many methods at once) but is reliable enough to update at least once or twice a day, although it's supposed to be updating every 6 hours.

Also, the amount of daily processing is still quite manageable and can be handled for a couple of additional months at the current rate, as most executions still take under 18 to 19 minutes to run.

CLOSING. Reason: No immediate action is required anymore.

datasets / covid-19