epiforecasts / covid-rt-estimates

National and subnational estimates of the time-varying reproduction number for Covid-19
https://epiforecasts.io/covid/
MIT License
34 stars 17 forks source link

Long runtimes in some regions #4

Closed seabbs closed 4 years ago

seabbs commented 4 years ago

EpiNow2 has large differences in runtime between regions. Some of this may be because of regions being fit in which there is very little to know data. Identifying these regions is the first step to fixing this behaviour.

joeHickson commented 4 years ago

I think you should be able to check the difference between the timestamps on the files to identify runtime (approx) latest_date.rds is written early in the process. Which dataset is the easiest (smallest) that demonstrate this behaviour?

seabbs commented 4 years ago

Hi Joe,

So checking this during a run I see the issue in Italy with everything finished apart from the Lazio region which is sitting on 1 core (with 100% usage) and has been for the last 30 mins +

joeHickson commented 4 years ago

I'm just knocking up a python script to scrape these and give us a region / runtime estimate dump. I am running global currently (I think it's almost finished) so I'll test it there and once it does Italy I'll check that Lazio comes up. I can't pretend to be able to understand why but I can at least come up with a list!

seabbs commented 4 years ago

Sounds like a great idea - getting a list of where the issues are is most of the work.

I just pushed the run I have so far which has taken quite a bit longer than expected (most regions run in a few minutes and some take over an hour).

seabbs commented 4 years ago

So this is the summary plot for Lazio.

The data looks sane enough so I think this might be an issue with the model (its a growth/decay model so stable cases (i.e here) are very difficult for it to fit too). I will do some more testing to see if that is the issue and then think about solutions.

I guess the confirmation that this is the problem and its not a compute issue etc would be if other regions with similar run times have a similar data profile.

joeHickson commented 4 years ago

one csv of region processing time in seconds from the global cases: region_processing_time.txt

import csv
import os 
from datetime import datetime 
regional_breakdown_location = r'/storage/maint/covid-rt/national/cases/national' 
location_result = {}
for location_dir in os.scandir(regional_breakdown_location):
    if location_dir.is_dir():
        max_date=None
        for date in os.scandir(location_dir):
            if date.name == 'latest': continue
            if not max_date or datetime.strptime(max_date, '%y-%m-%d') < datetime.strptime(date.name, '%y-%m-%d') :
                max_date = date.name
        if not max_date:
            location_result[location_dir.name]=-1
            continue
        start_file = location_dir.path + "/" + max_date + "/latest_date.rds"
        if not os.path.exists(start_file):
            location_result[location_dir.name]=-2
            continue
        start = os.path.getmtime(start_file)
        end_file = location_dir.path + "/" + max_date + "/summary.rds"
        if not os.path.exists(end_file):
            location_result[location_dir.name]=-3
            continue
        end = os.path.getmtime(end_file)
        location_result[location_dir.name]=end-start
w = csv.writer(open("analysis.csv", "w"))
for loc, time in sorted(location_result.items(), key=lambda kv:kv[1]):
    w.writerow([loc, time])

Ethiopia's "no result" is probably the single process that's still outstanding...

joeHickson commented 4 years ago

That does highlight a possible performance optimisation we have been contemplating on our side - if we could get the data for ALL the geographic locations (e.g. collate all the different datasets into a list of region datasets and the epinow configuration to go with them) then we could put them in a single queue for processing, rather than having to wait for each batch to complete in its entirety. It's not a high priority though.

seabbs commented 4 years ago

Thats really great - seems like it would be a great script to have running after every update as a check.

Hmm the estimates look okay for Ethiopia (obviously the end of forecast uncertainty is not ideal but more of a model problem than a computation issue/bug) - perhaps a bug in the scraping script?

I am thinking perhaps this just occurs for subnational estimates as they have imports from other regions playing a big role (which is probably less of an issue on the national scale). I am looking at updating the model to handle this better.

Running everything at once is a good idea. The two issues I see with that are if we start introducing regional differences (like different reporting delays etc) and how to group things for summary plots. Both of those seem very easy to overcome though with a bit of work.

seabbs commented 4 years ago

Had a go at adding imports yesterday to see if this addressed the issue (https://github.com/epiforecasts/EpiNow2/pull/41) but in fact, makes model fitting much slower with more failures so may need to pin that idea for now.

joeHickson commented 4 years ago

I'm was still running the main script - Ethiopia is the only location that hadn't yet completed (the script has safety checks in it). My python script couldn't see the modification date because it had been masked behind git check in / out.

re: running the script regularly - I would rather add logging to the epiforecasts side and get accurate logs out. There are a few options for how to do that but I'll have a think. Partly it will come down to what analytics we want and when. It might be we want to combine it with some form of data provenance file as I believe that can capture some timestamp info. My inclination might be to spit out the info into a log file. I believe there are logging libraries for R to make this easier - I don't know if you have a preference to one?

re: grouping - yes, there would need to be a way for each region to specify which additional data files go with it (reporting delays etc) so that the outer control layer can be agnostic. It's a tangent for this issue though!

seabbs commented 4 years ago

logs sounds like a good idea. I don't have a preference as to how they are generated though.

All sounds sensible.

I am going to close this as I think the run times are really a model issue and therefore belong in EpiNow2. Will investigate this a bit more and see what can be done to improve there.