Data Dictionary Production For Inform Visualization

AFg6K7h4fhy2 commented 4 days ago

For the scope of this PR, please reference issue #93 .

AFg6K7h4fhy2 commented 4 days ago

Items that require further consideration (this PR is not done yet):

Relative imports.
Checks for the existence or status of certain data files prior to use.
Docstring formatting.
Conditions for overwriting csvs in ./output/.
Naming conventions for the generated csvs.
Handling presence / non-presence combinations for data files prior to use.

AFg6K7h4fhy2 commented 4 days ago

For gen_tru_csv.R, I remembered that forecasttools has recode_locations. The file should be updated.

UPDATE: This has been done.

AFg6K7h4fhy2 commented 4 days ago

The args for pull_nhsn in gen_map_csv.R can be abstracted further (e.g. the reference date argument).

AFg6K7h4fhy2 commented 4 days ago

From SB:

Index csv files by reference-date (called forecast-date in some csvs).

For SB:

Review generated csvs aligned with column in ./weekly-summaries/README.md.
Review docstrings in ./weekly-summaries/utils/.

AFg6K7h4fhy2 commented 4 days ago

Move ./weekly-summaries/utils/ to ./src/code/ (from SB).

AFg6K7h4fhy2 commented 4 days ago

The folder structure ought (as presently decided) to look like either of:

weekly-summaries (name subject to change)
- all-forecasts (name subject to change)
- YYYY-MM-DD (i.e. the reference date)
  - file.csv (name subject to change)
- etc...

or

weekly-summaries (name subject to change)
- all-forecasts (name subject to change)
- YYYY-MM-DD.csv (i.e. the reference date; name subject to change)
- etc...

The latter is preferred.

The only conceivable reason (at present) to use the former approach is if there would be a .txt or .md file with comments about the generated .csv.

AFg6K7h4fhy2 commented 4 days ago

Generate files for first week of submission, account for the data anomalies. Does Inform want us to take into account exclusions? SB: exclude some locations for the first week of submissions.

AFg6K7h4fhy2 commented 4 days ago

EDIT: This has been fixed.

Should `reference_date` be a command line argument? Maybe some way of checking that the generated `reference_date` is a number plus the latest week as extracted from the generated csvs.

AFg6K7h4fhy2 commented 4 days ago

For / By Monday, December 02, 2024:

SB reviews the present state of the PR
- Review docstrings.
- Review synching of README.md and output content.
- Review the three scripts in utils.
UPX3 modifies the file output paths and file name references
UPX3 modifies the excluding of locations behavior.
UPX3 modifies relative pathing used.
UPX3 cleans docstrings and comments.
DHM reviews PR.

UPDATE: 2024-12-02

UPX3 changes model_id to model
UPX3 includes forecast_teams, forecast_fullnames
UPX3 includes baseline and ensemble model (thought these were to be excluded?)

sbidari commented 4 days ago

forecasts and truth data for US is missing (in all three csvs)

all_forecasts.csv

rename column name model_id -> model
add columns forecast_teams, forecast_fullnames
we also want to include the baseline model "CovidHub-baseline" and "CovidHub-ensmeble" to enable visual comparison with other models

truth_data.csv

exclude the locations indicated here. This is independent of the exclusions for the first week due to reporting error. Forecasts not solicited for these locations

AFg6K7h4fhy2 commented 1 day ago

EDIT: This has been fixed.

Re: Example `metadata` file: . The `forecast_name` will come from `model_name` and the `forecast_team` will come from `team_name`. Also, DHM mentions: the `pivot_hubverse_quantiles_wider` can accept any names, not just the ones used by default. Minor lapse on the author's end here...

AFg6K7h4fhy2 commented 1 day ago

What should the csv files be named? Currently:

map.csv
truth_data.csv
all_forecasts.csv

sbidari commented 1 day ago

I suggest the following structure: weekly-summaries\reference-date\

reference-date-map.csv
reference-date-truth_data.csv
reference-date-all_forecasts.csv

where reference-date is of format YYYY-MM-DD. We add a new folder named by corresponding reference-date every week, so reference-date should not be hardcoded in the codes.

AFg6K7h4fhy2 commented 1 day ago

Re: https://github.com/CDCgov/covid19-forecast-hub/pull/118#issuecomment-2511902915

This works, thank you SB.

AFg6K7h4fhy2 commented 1 day ago

EDIT: This has been fixed.

For the `all-forecasts` csv, want all submitted models (incl. the baseline and the ensemble). There will be a file with model inclusion into the ensemble.

For this code:

```r # add forecast team and model name current_forecasts <- current_forecasts |> dplyr::mutate( # extract model_name and team_name from # YAML metadata files forecast_team = sapply(model_id, function(model_id) { model_yaml_path <- file.path(model_metadata_path, paste0(model_id, ".yml")) # check if the YAML file exists if (file.exists(model_yaml_path)) { model_metadata <- yaml::read_yaml(model_yaml_path) # extract team_name return(model_metadata$team_name) } else { return(NA) # NA if file doesn't exist } }), forecast_fullnames = sapply(model_id, function(model_id) { model_yaml_path <- file.path(model_metadata_path, paste0(model_id, ".yml")) if (file.exists(model_yaml_path)) { model_metadata <- yaml::read_yaml(model_yaml_path) return(model_metadata$model_name) } else { return(NA) # NA if file doesn't exist } }) ) ```

A suitable replacement can come from: . This note originated in a call between DHM, SB, TM.

AFg6K7h4fhy2 commented 1 day ago

GitHub Actions (GHA) will likely break relative paths:

# store base metadata path for use later
model_metadata_path <- "../../model-metadata/" 

# get `covid19-forecast-hub` content
base_hub_path <- "../../"  
hub_content <- hubData::connect_hub(base_hub_path)

These should change to arguments to the argparse. Multiple arguments: one for output folder and one for base hub path.

This note originated in a call between DHM, SB, TM.

AFg6K7h4fhy2 commented 1 day ago

EDIT: This has been fixed.

Error if the current reference date's ensemble is not found:

Ensemble file behavior:

```r # load the latest ensemble data from the # model-output folder ensemble_folder <- "../../model-output/CovidHub-ensemble/" ensemble_file_current <- file.path(ensemble_folder, paste0(ref_date, "-CovidHub-ensemble.csv")) if (file.exists(ensemble_file_current)) { ensemble_file <- ensemble_file_current } else { ensemble_files <- list.files( ensemble_folder, pattern = "\\.csv$", full.names = TRUE ) if (length(ensemble_files) == 0) { stop("No ensemble CSV files found in the directory.") } ensemble_file <- tail(ensemble_files, 1) message("Using the latest file: ", ensemble_file) } ensemble_data <- readr::read_csv(ensemble_file) ```

This note originated in a call between DHM, SB, TM.

AFg6K7h4fhy2 commented 1 day ago

Issue is not expected here but renaming for consistency seems like a good idea (yaml → yml):

AFg6K7h4fhy2 commented 1 day ago

EDIT: This has been fixed.

Bug! Printing `model_metadata` returns (even after the switch from the author's custom solution to the solution in `hubData`): ``` # A tibble: 11 × 19 model_id team_abbr model_abbr team_name model_name model_version 1 CEPH-Rtrend_covid CEPH Rtrend_co… CEPH Lab… Rtrend CO… NA 2 CMU-TimeSeries CMU TimeSeries Carnegie… AR ensemb… 1.0 3 CMU-TimeSeries CMU TimeSeries Carnegie… AR ensemb… 1.0 4 JHU_CSSE-CSSE_Ensemb… JHU_CSSE CSSE_Ense… The Cent… CSSE Ense… NA 5 MOBS-GLEAM_COVID MOBS GLEAM_COV… MOBS Lab… GLEAM COV… 1.0 6 Metaculus-cp Metaculus cp Metaculus Metaculus… 1.0 7 OHT_JHU-nbxd OHT_JHU nbxd One Heal… NBEATS ex… NA 8 OHT_JHU-nbxd OHT_JHU nbxd One Heal… NBEATS ex… NA 9 UM-DeepOutbreak UM DeepOutbr… Universi… DeepOutbr… 1.0 10 UMass-ar6_pooled UMass ar6_pooled UMass-Am… AR(6) mod… 1.0 11 UMass-gbqr UMass gbqr UMass-Am… gradient … 1.0 ``` Where are the `CovidHub-ensemble.yaml` and `CovidHub-baseline.yaml`? Changing, in `CovidHub-baseline.yaml`, the argument `model_contributors: []` to ``` model_contributors: [ { "name": "Test", "affiliation": "Test", "email": "test@test.edu" } ] ``` produces ``` # A tibble: 12 × 19 model_id team_abbr model_abbr team_name model_name model_version 1 CEPH-Rtrend_covid CEPH Rtrend_co… CEPH Lab… Rtrend CO… NA 2 CMU-TimeSeries CMU TimeSeries Carnegie… AR ensemb… 1.0 3 CMU-TimeSeries CMU TimeSeries Carnegie… AR ensemb… 1.0 4 CovidHub-baseline CovidHub baseline CovidHub… CovidHub … 1.0 5 JHU_CSSE-CSSE_Ensemb… JHU_CSSE CSSE_Ense… The Cent… CSSE Ense… NA 6 MOBS-GLEAM_COVID MOBS GLEAM_COV… MOBS Lab… GLEAM COV… 1.0 7 Metaculus-cp Metaculus cp Metaculus Metaculus… 1.0 8 OHT_JHU-nbxd OHT_JHU nbxd One Heal… NBEATS ex… NA 9 OHT_JHU-nbxd OHT_JHU nbxd One Heal… NBEATS ex… NA 10 UM-DeepOutbreak UM DeepOutbr… Universi… DeepOutbr… 1.0 11 UMass-ar6_pooled UMass ar6_pooled UMass-Am… AR(6) mod… 1.0 12 UMass-gbqr UMass gbqr UMass-Am… gradient … 1.0 ``` So `model_contributors` can't be empty. Also, there seems to be some duplicates in the rows listed. @dylanhmorris @sbidari

AFg6K7h4fhy2 commented 1 day ago

The author would appreciate guidance on handling the below

There is this warning as well that is generated from the author's code below:

Warning

Warning message:
In dplyr::left_join(dplyr::mutate(dplyr::mutate(forecasttools::pivot_hubverse_quantiles_wider(hubverse_table = current_forecasts,  :
  Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 213 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.

Code

dplyr::left_join(
    model_metadata, by = "model_id") |>

dylanhmorris commented 1 day ago

Good catch @AFg6K7h4fhy2. @sbidari could you edit the metadata to list yourself and/or (as you prefer) the COVIDHub team (with the overall contact email) as model contributors on the Hub models? Thanks!

AFg6K7h4fhy2 commented 1 day ago

Re: https://github.com/CDCgov/covid19-forecast-hub/pull/118#issuecomment-2512740542

The author will pull once issue #120 is completed.

sbidari commented 1 day ago

I think we should exclude the locations indicated here or at-least a subset of it for the first week (reference-date = 2024-11-23). @dylanhmorris thoughts?

I forgot to mention this in the earlier meeting but had talked to @AFg6K7h4fhy2 previously about this

AFg6K7h4fhy2 commented 1 day ago

EDIT: These have been addressed.

These comments seem all that remain to be addressed: * * *

sbidari commented 1 day ago

The author would appreciate guidance on handling the below

This is likely due to the duplication of model_ids in model_metadata. You can try to extract only the unique model_id entries before the left_join. Will look into why there are duplicates in the model_metadata in #123

sbidari commented 13 hours ago

This is ready to merge. Thanks a lot @AFg6K7h4fhy2

CDCgov / covid19-forecast-hub

Data Dictionary Production For Inform Visualization #118