CDCgov / covid19-forecast-hub

A repository run by the US CDC to collect forecast data for the weekly new COVID-19 hospitalizations.
Apache License 2.0
6 stars 9 forks source link

Data Dictionary Production For Inform Visualization #118

Closed AFg6K7h4fhy2 closed 13 hours ago

AFg6K7h4fhy2 commented 4 days ago

For the scope of this PR, please reference issue #93 .

AFg6K7h4fhy2 commented 4 days ago

Items that require further consideration (this PR is not done yet):

AFg6K7h4fhy2 commented 4 days ago

For gen_tru_csv.R, I remembered that forecasttools has recode_locations. The file should be updated.

UPDATE: This has been done.

AFg6K7h4fhy2 commented 4 days ago

The args for pull_nhsn in gen_map_csv.R can be abstracted further (e.g. the reference date argument).

AFg6K7h4fhy2 commented 4 days ago

From SB:

For SB:

AFg6K7h4fhy2 commented 4 days ago

Move ./weekly-summaries/utils/ to ./src/code/ (from SB).

AFg6K7h4fhy2 commented 4 days ago

The folder structure ought (as presently decided) to look like either of:

or

The latter is preferred.

The only conceivable reason (at present) to use the former approach is if there would be a .txt or .md file with comments about the generated .csv.

AFg6K7h4fhy2 commented 4 days ago

Generate files for first week of submission, account for the data anomalies. Does Inform want us to take into account exclusions? SB: exclude some locations for the first week of submissions.

AFg6K7h4fhy2 commented 4 days ago

EDIT: This has been fixed.

Should `reference_date` be a command line argument? Maybe some way of checking that the generated `reference_date` is a number plus the latest week as extracted from the generated csvs.
AFg6K7h4fhy2 commented 4 days ago

For / By Monday, December 02, 2024:

UPDATE: 2024-12-02

sbidari commented 4 days ago

forecasts and truth data for US is missing (in all three csvs)

all_forecasts.csv

truth_data.csv

AFg6K7h4fhy2 commented 1 day ago

EDIT: This has been fixed.

Re: Example `metadata` file: . The `forecast_name` will come from `model_name` and the `forecast_team` will come from `team_name`. Also, DHM mentions: the `pivot_hubverse_quantiles_wider` can accept any names, not just the ones used by default. Minor lapse on the author's end here...
AFg6K7h4fhy2 commented 1 day ago

What should the csv files be named? Currently:

sbidari commented 1 day ago

I suggest the following structure: weekly-summaries\reference-date\

where reference-date is of format YYYY-MM-DD. We add a new folder named by corresponding reference-date every week, so reference-date should not be hardcoded in the codes.

AFg6K7h4fhy2 commented 1 day ago

Re: https://github.com/CDCgov/covid19-forecast-hub/pull/118#issuecomment-2511902915

This works, thank you SB.

AFg6K7h4fhy2 commented 1 day ago

EDIT: This has been fixed.

For the `all-forecasts` csv, want all submitted models (incl. the baseline and the ensemble). There will be a file with model inclusion into the ensemble.
For this code: ```r # add forecast team and model name current_forecasts <- current_forecasts |> dplyr::mutate( # extract model_name and team_name from # YAML metadata files forecast_team = sapply(model_id, function(model_id) { model_yaml_path <- file.path(model_metadata_path, paste0(model_id, ".yml")) # check if the YAML file exists if (file.exists(model_yaml_path)) { model_metadata <- yaml::read_yaml(model_yaml_path) # extract team_name return(model_metadata$team_name) } else { return(NA) # NA if file doesn't exist } }), forecast_fullnames = sapply(model_id, function(model_id) { model_yaml_path <- file.path(model_metadata_path, paste0(model_id, ".yml")) if (file.exists(model_yaml_path)) { model_metadata <- yaml::read_yaml(model_yaml_path) return(model_metadata$model_name) } else { return(NA) # NA if file doesn't exist } }) ) ```
A suitable replacement can come from: . This note originated in a call between DHM, SB, TM.
AFg6K7h4fhy2 commented 1 day ago

GitHub Actions (GHA) will likely break relative paths:

# store base metadata path for use later
model_metadata_path <- "../../model-metadata/" 

# get `covid19-forecast-hub` content
base_hub_path <- "../../"  
hub_content <- hubData::connect_hub(base_hub_path)

These should change to arguments to the argparse. Multiple arguments: one for output folder and one for base hub path.

This note originated in a call between DHM, SB, TM.

AFg6K7h4fhy2 commented 1 day ago

EDIT: This has been fixed.

Error if the current reference date's ensemble is not found:
Ensemble file behavior: ```r # load the latest ensemble data from the # model-output folder ensemble_folder <- "../../model-output/CovidHub-ensemble/" ensemble_file_current <- file.path(ensemble_folder, paste0(ref_date, "-CovidHub-ensemble.csv")) if (file.exists(ensemble_file_current)) { ensemble_file <- ensemble_file_current } else { ensemble_files <- list.files( ensemble_folder, pattern = "\\.csv$", full.names = TRUE ) if (length(ensemble_files) == 0) { stop("No ensemble CSV files found in the directory.") } ensemble_file <- tail(ensemble_files, 1) message("Using the latest file: ", ensemble_file) } ensemble_data <- readr::read_csv(ensemble_file) ```
This note originated in a call between DHM, SB, TM.
AFg6K7h4fhy2 commented 1 day ago

Issue is not expected here but renaming for consistency seems like a good idea (yamlyml):

Screenshot 2024-12-02 at 15 07 01
AFg6K7h4fhy2 commented 1 day ago

EDIT: This has been fixed.

Bug! Printing `model_metadata` returns (even after the switch from the author's custom solution to the solution in `hubData`): ``` # A tibble: 11 × 19 model_id team_abbr model_abbr team_name model_name model_version 1 CEPH-Rtrend_covid CEPH Rtrend_co… CEPH Lab… Rtrend CO… NA 2 CMU-TimeSeries CMU TimeSeries Carnegie… AR ensemb… 1.0 3 CMU-TimeSeries CMU TimeSeries Carnegie… AR ensemb… 1.0 4 JHU_CSSE-CSSE_Ensemb… JHU_CSSE CSSE_Ense… The Cent… CSSE Ense… NA 5 MOBS-GLEAM_COVID MOBS GLEAM_COV… MOBS Lab… GLEAM COV… 1.0 6 Metaculus-cp Metaculus cp Metaculus Metaculus… 1.0 7 OHT_JHU-nbxd OHT_JHU nbxd One Heal… NBEATS ex… NA 8 OHT_JHU-nbxd OHT_JHU nbxd One Heal… NBEATS ex… NA 9 UM-DeepOutbreak UM DeepOutbr… Universi… DeepOutbr… 1.0 10 UMass-ar6_pooled UMass ar6_pooled UMass-Am… AR(6) mod… 1.0 11 UMass-gbqr UMass gbqr UMass-Am… gradient … 1.0 ``` Where are the `CovidHub-ensemble.yaml` and `CovidHub-baseline.yaml`? Changing, in `CovidHub-baseline.yaml`, the argument `model_contributors: []` to ``` model_contributors: [ { "name": "Test", "affiliation": "Test", "email": "test@test.edu" } ] ``` produces ``` # A tibble: 12 × 19 model_id team_abbr model_abbr team_name model_name model_version 1 CEPH-Rtrend_covid CEPH Rtrend_co… CEPH Lab… Rtrend CO… NA 2 CMU-TimeSeries CMU TimeSeries Carnegie… AR ensemb… 1.0 3 CMU-TimeSeries CMU TimeSeries Carnegie… AR ensemb… 1.0 4 CovidHub-baseline CovidHub baseline CovidHub… CovidHub … 1.0 5 JHU_CSSE-CSSE_Ensemb… JHU_CSSE CSSE_Ense… The Cent… CSSE Ense… NA 6 MOBS-GLEAM_COVID MOBS GLEAM_COV… MOBS Lab… GLEAM COV… 1.0 7 Metaculus-cp Metaculus cp Metaculus Metaculus… 1.0 8 OHT_JHU-nbxd OHT_JHU nbxd One Heal… NBEATS ex… NA 9 OHT_JHU-nbxd OHT_JHU nbxd One Heal… NBEATS ex… NA 10 UM-DeepOutbreak UM DeepOutbr… Universi… DeepOutbr… 1.0 11 UMass-ar6_pooled UMass ar6_pooled UMass-Am… AR(6) mod… 1.0 12 UMass-gbqr UMass gbqr UMass-Am… gradient … 1.0 ``` So `model_contributors` can't be empty. Also, there seems to be some duplicates in the rows listed. @dylanhmorris @sbidari
AFg6K7h4fhy2 commented 1 day ago

The author would appreciate guidance on handling the below

There is this warning as well that is generated from the author's code below:

Warning

Warning message:
In dplyr::left_join(dplyr::mutate(dplyr::mutate(forecasttools::pivot_hubverse_quantiles_wider(hubverse_table = current_forecasts,  :
  Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 213 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.

Code

dplyr::left_join(
    model_metadata, by = "model_id") |>
dylanhmorris commented 1 day ago

Good catch @AFg6K7h4fhy2. @sbidari could you edit the metadata to list yourself and/or (as you prefer) the COVIDHub team (with the overall contact email) as model contributors on the Hub models? Thanks!

AFg6K7h4fhy2 commented 1 day ago

Re: https://github.com/CDCgov/covid19-forecast-hub/pull/118#issuecomment-2512740542

The author will pull once issue #120 is completed.

sbidari commented 1 day ago

I think we should exclude the locations indicated here or at-least a subset of it for the first week (reference-date = 2024-11-23). @dylanhmorris thoughts?

I forgot to mention this in the earlier meeting but had talked to @AFg6K7h4fhy2 previously about this

AFg6K7h4fhy2 commented 1 day ago

EDIT: These have been addressed.

These comments seem all that remain to be addressed: * * *
sbidari commented 1 day ago

The author would appreciate guidance on handling the below

This is likely due to the duplication of model_ids in model_metadata. You can try to extract only the unique model_id entries before the left_join. Will look into why there are duplicates in the model_metadata in #123

sbidari commented 13 hours ago

This is ready to merge. Thanks a lot @AFg6K7h4fhy2