Closed seabbs closed 4 years ago
I think you should be able to check the difference between the timestamps on the files to identify runtime (approx) latest_date.rds is written early in the process. Which dataset is the easiest (smallest) that demonstrate this behaviour?
Hi Joe,
So checking this during a run I see the issue in Italy with everything finished apart from the Lazio region which is sitting on 1 core (with 100% usage) and has been for the last 30 mins +
I'm just knocking up a python script to scrape these and give us a region / runtime estimate dump. I am running global currently (I think it's almost finished) so I'll test it there and once it does Italy I'll check that Lazio comes up. I can't pretend to be able to understand why but I can at least come up with a list!
Sounds like a great idea - getting a list of where the issues are is most of the work.
I just pushed the run I have so far which has taken quite a bit longer than expected (most regions run in a few minutes and some take over an hour).
So this is the summary plot for Lazio.
The data looks sane enough so I think this might be an issue with the model (its a growth/decay model so stable cases (i.e here) are very difficult for it to fit too). I will do some more testing to see if that is the issue and then think about solutions.
I guess the confirmation that this is the problem and its not a compute issue etc would be if other regions with similar run times have a similar data profile.
one csv of region processing time in seconds from the global cases: region_processing_time.txt
import csv
import os
from datetime import datetime
regional_breakdown_location = r'/storage/maint/covid-rt/national/cases/national'
location_result = {}
for location_dir in os.scandir(regional_breakdown_location):
if location_dir.is_dir():
max_date=None
for date in os.scandir(location_dir):
if date.name == 'latest': continue
if not max_date or datetime.strptime(max_date, '%y-%m-%d') < datetime.strptime(date.name, '%y-%m-%d') :
max_date = date.name
if not max_date:
location_result[location_dir.name]=-1
continue
start_file = location_dir.path + "/" + max_date + "/latest_date.rds"
if not os.path.exists(start_file):
location_result[location_dir.name]=-2
continue
start = os.path.getmtime(start_file)
end_file = location_dir.path + "/" + max_date + "/summary.rds"
if not os.path.exists(end_file):
location_result[location_dir.name]=-3
continue
end = os.path.getmtime(end_file)
location_result[location_dir.name]=end-start
w = csv.writer(open("analysis.csv", "w"))
for loc, time in sorted(location_result.items(), key=lambda kv:kv[1]):
w.writerow([loc, time])
Ethiopia's "no result" is probably the single process that's still outstanding...
That does highlight a possible performance optimisation we have been contemplating on our side - if we could get the data for ALL the geographic locations (e.g. collate all the different datasets into a list of region datasets and the epinow configuration to go with them) then we could put them in a single queue for processing, rather than having to wait for each batch to complete in its entirety. It's not a high priority though.
Thats really great - seems like it would be a great script to have running after every update as a check.
Hmm the estimates look okay for Ethiopia (obviously the end of forecast uncertainty is not ideal but more of a model problem than a computation issue/bug) - perhaps a bug in the scraping script?
I am thinking perhaps this just occurs for subnational estimates as they have imports from other regions playing a big role (which is probably less of an issue on the national scale). I am looking at updating the model to handle this better.
Running everything at once is a good idea. The two issues I see with that are if we start introducing regional differences (like different reporting delays etc) and how to group things for summary plots. Both of those seem very easy to overcome though with a bit of work.
Had a go at adding imports yesterday to see if this addressed the issue (https://github.com/epiforecasts/EpiNow2/pull/41) but in fact, makes model fitting much slower with more failures so may need to pin that idea for now.
I'm was still running the main script - Ethiopia is the only location that hadn't yet completed (the script has safety checks in it). My python script couldn't see the modification date because it had been masked behind git check in / out.
re: running the script regularly - I would rather add logging to the epiforecasts side and get accurate logs out. There are a few options for how to do that but I'll have a think. Partly it will come down to what analytics we want and when. It might be we want to combine it with some form of data provenance file as I believe that can capture some timestamp info. My inclination might be to spit out the info into a log file. I believe there are logging libraries for R to make this easier - I don't know if you have a preference to one?
re: grouping - yes, there would need to be a way for each region to specify which additional data files go with it (reporting delays etc) so that the outer control layer can be agnostic. It's a tangent for this issue though!
logs sounds like a good idea. I don't have a preference as to how they are generated though.
All sounds sensible.
I am going to close this as I think the run times are really a model issue and therefore belong in EpiNow2
. Will investigate this a bit more and see what can be done to improve there.
EpiNow2
has large differences in runtime between regions. Some of this may be because of regions being fit in which there is very little to know data. Identifying these regions is the first step to fixing this behaviour.