epiforecasts / covid-rt-estimates

National and subnational estimates of the time-varying reproduction number for Covid-19
https://epiforecasts.io/covid/
MIT License
34 stars 17 forks source link

Cron run failing #138

Closed joeHickson closed 3 years ago

joeHickson commented 3 years ago

It seems to have only generated a subset of files for the UK regions (and other datasets) - https://github.com/epiforecasts/covid-rt-estimates/tree/master/subnational/united-kingdom/cases/national/Scotland/latest as an example. I have a log file full of tmp file issues again. I'm re-running to see whats going on

seabbs commented 3 years ago

That looks complete to me? Not sure what I am missing. Quite surprised to see error and trace in their as that should only be being saved on error.

joeHickson commented 3 years ago

@seabbs I re-ran the uk datasets by hand on saturday so you might have been seeing the results of that. https://github.com/epiforecasts/covid-rt-estimates/tree/587995ba64056b2d43ba2cd478fbf421b78bb1c2/subnational/united-kingdom/cases/national/Scotland/latest will show you it in the partial state

seabbs commented 3 years ago

Just to clarify this issue. Everything works for any dataset works when run by hand (what is the exact runtime instruction here?) and only the first dataset works when running on the CRON job with all others failing due to the lack of a temporary directory?

Is that all correct?

seabbs commented 3 years ago

Or they only work when run by hand in the test repo and not in the production one? I don't understand the behaviour change between the tests and production. What is the difference here? Or are we actually saying the tests never worked?

joeHickson commented 3 years ago

they fail at random (probably more than half of the time). Running them by hand on production wins by sheer number of retries. I think the tests worked but I think that was because it got lucky!

seabbs commented 3 years ago

Have you tried rolling back the future setting?

seabbs commented 3 years ago

in testing we tried 4 datasets all of which worked?

joeHickson commented 3 years ago

I can't remember - it might have been 2 (united-kingdom + canada) but I couldn't tell you with confidence either way

joeHickson commented 3 years ago

Running now with future = false

joeHickson commented 3 years ago

That worked for uk admissions - I'll see how the cron picks up overnight.

seabbs commented 3 years ago

Are you seeing issues due to old results (which are not present on the test server) conflicting with new results? Maybe clearing previous estimates out would help?

joeHickson commented 3 years ago

just waiting on the cron to run - it ended up firing 9 hours late because I still had it set from trying to play catch up at the weekend.

seabbs commented 3 years ago

Looks good so far. Think a potential problem may occur if somewhere fails/times out due to the issue above but we will see. Been reading more about detecting what has happened in Stan model so may be able to tighten that up but still no idea what the previous error was or why this might have worked. The tempdir issue is worrying as I have been unable to reproduce elsewhere.

seabbs commented 3 years ago

Also are these files at root something to do with a recent change? (https://github.com/epiforecasts/covid-rt-estimates/blob/master/united-kingdom-admissions_raw_outcome.rds)

joeHickson commented 3 years ago

smells like debug code to me

seabbs commented 3 years ago

Looks like maybe a git ignore needed as the US update (🥳 ) just pushing a similar object: https://github.com/epiforecasts/covid-rt-estimates/blob/master/united-states_raw_outcome.rds

joeHickson commented 3 years ago

nah, that was in progress when I put the last commit in so should be the final one

joeHickson commented 3 years ago

This is looking promising - lets see what happens tonight and possibly close it in the morning

seabbs commented 3 years ago

Summary plots failing likely due to the presence of renamed columns (due to old estimates being present on the production server).

See https://raw.githubusercontent.com/epiforecasts/covid-rt-estimates/master/national/cases/summary/rt.csv

Suggested fix is to remove archived estimates. This likely applies to all datasets that have regions that were once estimates and now are not.

sbfnk commented 3 years ago

That sounds like a good idea - same issue afflicts e.g. where one subregion fails, see e.g. https://github.com/epiforecasts/covid-rt-estimates/blob/master/subnational/united-states/cases/summary/rt.csv which is a mixture of old and new column names because South Carolina failed.

@joeHickson could you remove all archived estimates before the next run, and re-run the US?

seabbs commented 3 years ago

It looks like status.csv does not contain the correct information for the UK estimates?

joeHickson commented 3 years ago

shall I set the refresh flag on the cron script to force a flush?

seabbs commented 3 years ago

I assume the status.csv is another issue?

sbfnk commented 3 years ago

Flush as in flush all estimates? In that case perhaps consider scheduling a re-run of everything that has failed at the end? A benefit of keeping old estimates is that once we're back to daily operation if one run fails there is still a fairly recent update unless there is something systematic about a data set that makes the model fall over.

joeHickson commented 3 years ago

status didn't update following a 403 error from dataverse that went away by the later scripts.

joeHickson commented 3 years ago

refresh does this:

 if (refresh) {
      if (dir.exists(location$target_folder)) {
        futile.logger::flog.trace("removing estimates in order to refresh")
        unlink(location$target_folder, recursive = TRUE)
      }
    }
joeHickson commented 3 years ago

the refresh option seems to have left it unhappy. I have no summary files at all - I think it might be that whilst it issues the warning for unable to load file it then does horrible things where it's used (get.R L121). Running readRDS locally with a garbage file it produces an error AND a warning - I think we are seeing one but not the other. Perhaps this could be resolved with a pre-filter on line 92 to remove those that don't have the final .rds file (and therefore failed)?

020-11-25 16:21:56 INFO Regions with runtime errors: 1
2020-11-25 16:21:56 INFO Runtime error in South West : South West: model fitting was timed out - try increasing the max_execution_time - 
2020-11-25 16:21:56 INFO Saving timings information to : subnational/united-kingdom/cases/national
2020-11-25 16:21:56 DEBUG resetting future plan to sequential
2020-11-25 16:21:56 TRACE generating summary data
2020-11-25 16:21:56 INFO Saving summary to : subnational/united-kingdom/cases/summary
2020-11-25 16:21:56 INFO Extracting results from: subnational/united-kingdom/cases/national
2020-11-25 16:21:56 TRACE Getting regional results
2020-11-25 16:21:56 WARN simpleWarning in gzfile(file, "rb"): cannot open compressed file 'subnational/united-kingdom/cases/national/South West/latest/summarised_estimates.rds', probable rea
son 'No such file or directory'

2020-11-25 16:21:56 TRACE reading runtimes.csv

local output from rstudio:

foo <- readRDS("nothere.RDS")
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file 'nothere.RDS', probable reason 'No such file or directory'
sbfnk commented 3 years ago

Looks like most things are there except still no summary plots / csvs in some cases e.g. here: https://github.com/epiforecasts/covid-rt-estimates/tree/master/national/cases/summary Is this because some countries are no longer being estimated, as Sam suggests? E.g. here there are a bunch at the top that still have the old column headers: https://raw.githubusercontent.com/epiforecasts/covid-rt-estimates/master/national/cases/summary/rt.csv As it's only a few, could these just be removed manually?

joeHickson commented 3 years ago

@sbfnk I was unable to get the clean refresh to run yesterday (see prior comment). As such todays results are unlikely to differ in structural result to yesterdays.

sbfnk commented 3 years ago

Yes, but the non-refreshed run of cases the overnight run seems to have worked - so the only thing failing is the summary (presumably because of stray results from the past because of countries that are no longer being estimated)?

joeHickson commented 3 years ago

The overnight run processed but some subregions are failing so the summaries are containing the mix of results shapes and partially failing

sbfnk commented 3 years ago

As far as I can see if you removed old estimates for St. Kitts & Nevis, Fiji, British Virgin Islands, Western Sahara, New Caledonia, Nicaragua, Montserrat, Guinea-Bissau, there would be no more old results.

joeHickson commented 3 years ago

but when it is rerun it will then fail because the folder will exist for those sublocations with a partial set of result files (all the failures at present seem to be timeouts) but not the summarised_estimates.RDS. A manual removal would have the same effect as using the refresh flag. I could temporarily run the region with a -e flag excluding the areas we anticipate timing out but the next time they are included the summary will fail if those subregion doesn't process.

seabbs commented 3 years ago

Pulling locally and deleting all files in St kitts folder but leaving the folder this is what I see.

> results <- get_regional_results(
+                                 results_dir = "national/cases/national",
+                                 samples = FALSE,
+                                 forecast = FALSE)
Warning message:
In gzfile(file, "rb") :
  cannot open compressed file 'national/cases/national/St. Lucia/latest/summarised_estimates.rds', probable reason 'No such file or directory'
> 
> results
$estimates
$estimates$summarised
             region       date                 variable strat     type      median         mean           sd  lower_90   lower_50   lower_20    upper_20
     1: Afghanistan 2020-08-30                        R  <NA> estimate   1.0283317    1.0431218 1.592124e-01 0.8308071  0.9517083  0.9983677   1.0549592
     2: Afghanistan 2020-08-31                        R  <NA> estimate   1.0264497    1.0386057 1.449917e-01 0.8395174  0.9538103  0.9988531   1.0528583
     3: Afghanistan 2020-09-01                        R  <NA> estimate   1.0256086    1.0340139 1.320410e-01 0.8457254  0.9551772  0.9989090   1.0508716
     4: Afghanistan 2020-09-02                        R  <NA> estimate   1.0236139    1.0293454 1.203152e-01 0.8512935  0.9563708  0.9980154   1.0483355
     5: Afghanistan 2020-09-03                        R  <NA> estimate   1.0214960    1.0246156 1.097498e-01 0.8573886  0.9575690  0.9973209   1.0452908
    ---                                                                                                                                                 
102725:    Zimbabwe 2020-12-03           reported_cases  <NA> forecast  67.5000000  657.9985000 1.276615e+04 5.0000000 25.0000000 46.0000000 102.0000000
102726:    Zimbabwe 2020-12-04           reported_cases  <NA> forecast 100.5000000 1236.0882500 1.850625e+04 7.0000000 36.0000000 66.0000000 154.0000000
102727:    Zimbabwe 2020-12-05           reported_cases  <NA> forecast 129.0000000 3136.0827500 7.851510e+04 9.0000000 42.0000000 81.6000000 196.0000000
102728:    Zimbabwe 2020-12-06           reported_cases  <NA> forecast  79.5000000 1892.4100000 4.164878e+04 5.0000000 24.0000000 51.0000000 129.0000000
102729:    Zimbabwe       <NA> reporting_overdispersion  <NA>     <NA>   0.3106832    0.3322482 1.243521e-01 0.1797304  0.2379935  0.2817226   0.3399858
           upper_50     upper_90 bottom top lower upper central_lower central_upper
     1:   1.1005886    1.3386775     NA  NA    NA    NA            NA            NA
     2:   1.0962464    1.3084931     NA  NA    NA    NA            NA            NA
     3:   1.0910206    1.2784433     NA  NA    NA    NA            NA            NA
     4:   1.0869074    1.2548376     NA  NA    NA    NA            NA            NA
     5:   1.0834770    1.2140677     NA  NA    NA    NA            NA            NA
    ---                                                                            
102725: 200.0000000 1065.7500000     NA  NA    NA    NA            NA            NA
102726: 313.0000000 1890.4000000     NA  NA    NA    NA            NA            NA
102727: 403.2500000 2697.4500000     NA  NA    NA    NA            NA            NA
102728: 273.0000000 1970.3000000     NA  NA    NA    NA            NA            NA
102729:   0.3946213    0.5656102     NA  NA    NA    NA            NA            NA

With fault tolerance working as expected (i.e by giving a warning and no error).

As this problem is in the summary it can be debuged without rerunning estimates and trying to explore logs.

seabbs commented 3 years ago

Again running localling and so being able to see errors I see:

> regional_summary(reported_cases= reported_cases, results_dir = "national/cases/national", all_regions = FALSE) -> tmp
INFO [2020-11-26 13:31:56] No summary directory specified so returning summary output
INFO [2020-11-26 13:31:56] Extracting results from: national/cases/national
Error in data.table::rbindlist(numeric_estimate) : 
  Item 53 has 7 columns, inconsistent with item 1 which has 9 columns. To fill missing columns use fill=TRUE.
In addition: Warning messages:
1: In gzfile(file, "rb") :
  cannot open compressed file 'national/cases/national/St. Lucia/latest/summarised_estimates.rds', probable reason 'No such file or directory'
2: In gzfile(file, "rb") :

 Show Traceback

 Rerun with Debug
 Error in data.table::rbindlist(numeric_estimate) : 
  Item 53 has 7 columns, inconsistent with item 1 which has 9 columns. To fill missing columns use fill=TRUE. 
seabbs commented 3 years ago

Which is an issue as pointed out by Seb with older estimates still being present in the estimates as published to GitHub

seabbs commented 3 years ago

Patching that (see EpiNow2@v1.3.2) I now see the following successful summary:

> reported_cases <- data.table::as.data.table(covidregionaldata::get_national_data())[, .(date, region = country, confirm = cases_new)]
> regional_summary(reported_cases= reported_cases, results_dir = "national/cases/national", all_regions = FALSE) -> tmp
INFO [2020-11-26 13:47:15] No summary directory specified so returning summary output
INFO [2020-11-26 13:47:15] Extracting results from: national/cases/national
Warning messages:
1: In gzfile(file, "rb") :
  cannot open compressed file 'national/cases/national/St. Lucia/latest/summarised_estimates.rds', probable reason 'No such file or directory'
2: In gzfile(file, "rb") :
  cannot open compressed file 'national/cases/national/St. Lucia/latest/summary.rds', probable reason 'No such file or directory'
> tmp
$latest_date
[1] "2020-11-26"

$results
$results$estimates
$results$estimates$summarised
             region       date                 variable strat     type      median         mean           sd  lower_90   lower_50   lower_20    upper_20
     1: Afghanistan 2020-08-30                        R  <NA> estimate   1.0283317    1.0431218 1.592124e-01 0.8308071  0.9517083  0.9983677   1.0549592
     2: Afghanistan 2020-08-31                        R  <NA> estimate   1.0264497    1.0386057 1.449917e-01 0.8395174  0.9538103  0.9988531   1.0528583
     3: Afghanistan 2020-09-01                        R  <NA> estimate   1.0256086    1.0340139 1.320410e-01 0.8457254  0.9551772  0.9989090   1.0508716
     4: Afghanistan 2020-09-02                        R  <NA> estimate   1.0236139    1.0293454 1.203152e-01 0.8512935  0.9563708  0.9980154   1.0483355
     5: Afghanistan 2020-09-03                        R  <NA> estimate   1.0214960    1.0246156 1.097498e-01 0.8573886  0.9575690  0.9973209   1.0452908

Where the warnings indicate missing results but should cause no failure.

seabbs commented 3 years ago

Updating this to save to disk I see the following:

Screenshot 2020-11-26 at 13 50 34

which looks successful.

seabbs commented 3 years ago

Repeating with data deletions at random I still see success.

Screenshot 2020-11-26 at 13 56 29

joeHickson commented 3 years ago

It might be that warning is just a warning and something else is falling over. It's always fun debugging this lot! I'll try flicking us over to 1.3.2 (it looks like it's currently 1.3.0) and see what gives if I run it with --refresh update: I see you beat me to the 1.3.2 trick ;)

joeHickson commented 3 years ago

That's warming up all the cores now running with --refresh. I'll try and keep an eye out for the first UK cases dataset to finish (it should push to git if it doesn't error)

joeHickson commented 3 years ago
2020-11-26 15:08:30 INFO Regions with estimates: 9
2020-11-26 15:08:30 INFO Regions with runtime errors: 3
2020-11-26 15:08:30 INFO Runtime error in Midlands : Midlands: model fitting was timed out - try increasing the max_execution_time - 
2020-11-26 15:08:30 INFO Runtime error in South West : South West: model fitting was timed out - try increasing the max_execution_time -

2020-11-26 15:08:30 INFO Runtime error in United Kingdom : United Kingdom: model fitting was timed out - try increasing the max_executio
n_time - 
2020-11-26 15:08:30 INFO Saving timings information to : subnational/united-kingdom/cases/national
2020-11-26 15:08:30 DEBUG resetting future plan to sequential
2020-11-26 15:08:30 TRACE generating summary data
2020-11-26 15:08:30 INFO Saving summary to : subnational/united-kingdom/cases/summary
2020-11-26 15:08:30 INFO Extracting results from: subnational/united-kingdom/cases/national
2020-11-26 15:08:30 TRACE Getting regional results
2020-11-26 15:08:30 WARN simpleWarning in gzfile(file, "rb"): cannot open compressed file 'subnational/united-kingdom/cases/national/Mid
lands/latest/summarised_estimates.rds', probable reason 'No such file or directory'

2020-11-26 15:08:30 TRACE reading runtimes.csv
2020-11-26 15:08:30 TRACE naming output
2020-11-26 15:08:30 DEBUG add stats to output
2020-11-26 15:08:30 TRACE publish_data function
joeHickson commented 3 years ago

I don't think that's produced any summary files again - https://github.com/epiforecasts/covid-rt-estimates/tree/master/subnational/united-kingdom/cases

seabbs commented 3 years ago

Again debugging locally I see the following:

library(data.table)
library(EpiNow2)
library(covidregionaldata)

reported_cases <- fread("subnational/united-kingdom/cases/summary/reported_cases.csv")

regional_summary(reported_cases = reported_cases, 
                 results_dir = "subnational/united-kingdom/cases/national",
                 summary_dir = "subnational/united-kingdom/cases/summary",
                 all_regions = TRUE)

INFO [2020-11-26 15:30:07] Saving summary to : subnational/united-kingdom/cases/summary
INFO [2020-11-26 15:30:07] Extracting results from: subnational/united-kingdom/cases/national
Error: Incompatible classes: <IDate> + <Period>
In addition: Warning messages:
1: In gzfile(file, "rb") :
  cannot open compressed file 'subnational/united-kingdom/cases/national/Midlands/latest/summarised_estimates.rds', probable reason 'No such file or directory'
2: In gzfile(file, "rb") :
  cannot open compressed file 'subnational/united-kingdom/cases/national/South West/latest/summarised_estimates.rds', probable reason 'No such file or directory'
3: In gzfile(file, "rb") :
  cannot open compressed file 'subnational/united-kingdom/cases/national/United Kingdom/latest/summarised_estimates.rds', probable reason 'No such file or directory'
4: In gzfile(file, "rb") :
  cannot open compressed file 'subnational/united-kingdom/cases/national/Midlands/latest/summary.rds', probable reason 'No such file or directory'
5: In gzfile(file, "rb") :
  cannot open compressed file 'subnational/united-kingdom/cases/national/South West/latest/summary.rds', probable reason 'No such file or directory'
6: In gzfile(file, "rb") :

 Show Traceback

 Rerun with Debug
 Error: Incompatible classes: <IDate> + <Period> 
seabbs commented 3 years ago

I do see all results except plots have been updated.

seabbs commented 3 years ago

Dropping into debug using debugonce(regional_summary) I see that this was due to a get_regions_with_most_reports and an issue with the date formatting caused by saving and reading back in reported cases. Adding the following resolved:

reported_cases <- reported_cases[, date := as.Date(date)]
seabbs commented 3 years ago

Running the following:

library(data.table)
library(EpiNow2)
library(covidregionaldata)

reported_cases <- fread("subnational/united-kingdom/cases/summary/reported_cases.csv")
reported_cases <- reported_cases[, date := as.Date(date)]

regional_summary(reported_cases = reported_cases, 
                 results_dir = "subnational/united-kingdom/cases/national",
                 summary_dir = "subnational/united-kingdom/cases/summary",
                 all_regions = TRUE)

Results in no errors and a folder structure as below which looks complete.

Screenshot 2020-11-26 at 15 38 54

joeHickson commented 3 years ago

I don't suppose it's something to do with the fact it's running with slightly different params?

regional_summary(
      reported_cases = cases,
                 results_dir = "subnational/united-kingdom/cases/national",
                 summary_dir = "subnational/united-kingdom/cases/summary",
      region_scale = "Region",
      all_regions = True,
      return_output = FALSE
    )
joeHickson commented 3 years ago

ignore that - I can see that it's just default values.

seabbs commented 3 years ago
library(data.table)
library(EpiNow2)
library(covidregionaldata)

reported_cases <- fread("subnational/united-kingdom/cases/summary/reported_cases.csv")
reported_cases <- reported_cases[, date := as.Date(date)]

regional_summary(reported_cases = reported_cases, 
                 results_dir = "subnational/united-kingdom/cases/national",
                 summary_dir = "subnational/united-kingdom/cases/summary",
                 region_scale = "Region",
                 all_regions = TRUE,
                 return_output = FALSE)

Updated and still works as expected.