epiforecasts / socialmixr

R package for deriving social mixing matrices from survey data.
http://epiforecasts.io/socialmixr/
Other
38 stars 11 forks source link

CoMix UK data not downloading with get_survey() #108

Closed adamkucharski closed 2 months ago

adamkucharski commented 8 months ago

I was exploring adding a CoMix example to this how to, because it's a nice illustration of value of next generation matrices, but found that the get_survey() function started downloading but then seemed to freeze (see below output).

Seems to work OK with downloading other datasets in the Social contact data library, so not sure if I'm missing something, or if it's a formatting issue with that specific dataset?

get_survey("https://zenodo.org/doi/10.5281/zenodo.4905745")

Getting CoMix social contact data (UK).
Downloading https://zenodo.org/records/6542524/files/CoMix_uk_contact_common.csv
Downloading https://zenodo.org/records/6542524/files/CoMix_uk_contact_extra.csv
Downloading https://zenodo.org/records/6542524/files/CoMix_uk_hh_common.csv
Downloading https://zenodo.org/records/6542524/files/CoMix_uk_participant_common.csv
Downloading https://zenodo.org/records/6542524/files/CoMix_uk_participant_extra.csv
Downloading https://zenodo.org/records/6542524/files/CoMix_uk_sday.csv

Then when cancel:

Warning messages:
1: In load_survey(files) :
  Only 48243 matching values in 'X', 'part_id' columns when pulling comix_uk_sday.csv into 'participant' survey.
2: In load_survey(files) :
  225009 row(s) could not be matched when pulling comix_uk_sday.csv into 'participant' survey
sbfnk commented 7 months ago

I can't reproduce this - can you check it's definitely not a connection/computation issue (I need to wait ~1 minute on a laptop after the last line you quote for the call to finish).

adamkucharski commented 7 months ago

Thanks for the response. It looks like it is eventualy loading, but required quite a lot of time for me (~10 mins on MacBook Air). A follow up question I had was how to stratify on survey wave, e.g. subset on wave 1 participants in CoMix, and generate the corresponding social contact matrix - at the moment it's returning NA entries (see below), but there's probably something simple I'm not getting.

Full draft walkthrough is here.

# get UK CoMix data from 2020-22 - note this is slow to load
comix_uk <- get_survey("https://zenodo.org/doi/10.5281/zenodo.4905745")

# subset on 1st wave of surveys in April/May 2020
comix_uk_wave_1 <- comix_uk
comix_uk_wave_1$participants <- comix_uk$participants |> filter(wave==1)

contact_data_comix <- socialmixr::contact_matrix(
  comix_uk_wave_1,
  countries = "United Kingdom",
  age.limits = c(0, 5, 18, 40, 65),
  symmetric = TRUE
)
> contact_data_comix
$matrix
      contact.age.group
       [0,5) [5,18) [18,40) [40,65) 65+
  [1,]    NA     NA      NA      NA  NA
  [2,]    NA     NA      NA      NA  NA
  [3,]    NA     NA      NA      NA  NA
  [4,]    NA     NA      NA      NA  NA
  [5,]    NA     NA      NA      NA  NA

$demography
   age.group population proportion  year
      <char>      <num>      <num> <int>
1:     [0,5)    4370647         NA  1950
2:    [5,18)    8910009         NA  1950
3:   [18,40)   15997248         NA  1950
4:   [40,65)   15855240         NA  1950
5:       65+         NA         NA  1950

$participants
   age.group participants   proportion
      <char>        <int>        <num>
1:     [0,5)          192 0.1594684385
2:    [5,18)          998 0.8289036545
3:   [18,40)            1 0.0008305648
4:   [40,65)           11 0.0091362126
5:       65+            2 0.0016611296
sbfnk commented 7 months ago

It looks like it is eventualy loading, but required quite a lot of time for me (~10 mins on MacBook Air)

Might be good to try to separate out whether it's a data download vs. processing issue. Can you

  1. update to the latest version (0.3.2 on CRAN)? We made some quite substantial speed improvements very recently
  2. download first, then process, i.e. something along the lines of
library("socialmixr")
dir.create("comix_data")
survey_files <- download_survey("https://zenodo.org/doi/10.5281/zenodo.4905745", "comix_data")
comix_uk <- load_survey(survey_files)
sbfnk commented 7 months ago

A follow up question I had was how to stratify on survey wave, e.g. subset on wave 1 participants in CoMix, and generate the corresponding social contact matrix - at the moment it's returning NA entries (see below), but there's probably something simple I'm not getting.

This looks like a bug - will put in separate issue.

adamkucharski commented 7 months ago

Thanks. I've updated to 0.3.2. Profiling the above steps, download_survey("https://zenodo.org/doi/10.5281/zenodo.4905745", "comix_data") took me 16s andload_survey(survey_files) took 6.4 mins

sbfnk commented 2 months ago

Could you reinstall with remotes::install_github("epiforecasts/socialmixr@clean-speedup") and re-do the profiling (especially load_survey)?

adamkucharski commented 2 months ago

Reinstalled that dev version. Profiling the above steps, download_survey("https://zenodo.org/doi/10.5281/zenodo.4905745", "comix_data") took me 29s and load_survey(survey_files) took 43s – much quicker for the latter!

sbfnk commented 2 months ago

Great - most of the remaining time is now spent in merging the multiple survey files where speed gains are less obvious. I'll assume the initial issue addressed and close it.