GIS4DEV / GIS4DEV.github.io

Open Source GIScience & GIS for Development
1 stars 12 forks source link

Any issues / solutions with running the R script for malcomb et al? #38

Closed josephholler closed 3 years ago

josephholler commented 3 years ago

I've heard of a few folks having trouble getting the DHS data to download with the R script after verifying that the rdhs.json file contains the correct email, password, and project name. How common is this problem? Has anyone seen and resolved it?

I've also personally had trouble running the R script on two computers that had older installations of R on them, and I fixed each of those by installing updated R language, then updating all of my installed packages, and THEN running the script after restarting R. Older versions of R / packages weren't implementing the functions to import & zaplabels for the DHS data correctly.

Another common issue I've seen is not opening or starting an R project inside the root folder of the RP-Rosgen repository. This is important so that the scripts can find all of the referenced files & folders.

josephholler commented 3 years ago

FYI if you're able to download the household survey data separately, its in SPSS format, and the next steps of R code were wrangling it from SPSS into a good data frame in R. The village points data can probably be dealt with similarly to other shapefiles/geographic files.

josephholler commented 3 years ago

This line of code

dhs_downloads = get_datasets(
  c("MWHR61SV", "MWGE62FL", "MWHR4ESV", "MWGE4BFL"),
  all_lower = FALSE,
  download_option = "rds"
)

is downloading four datasets in .rds (R data) format, placing them inside a datasets folder, and saving the path name to the data in the dhs_downloads environment variable. If that code isn't working for whatever reason, you can probably download the necessary .rds files on your own, and set the pathname as you'd expect in the dhs_downloads table, OR even forget the dhs_downloads and replace the use of dhs_downloads$MWGE62FL (and similar) later on with the path name to the correct .rds file. I still think the root problem is probably out-of date R language, specific packages being used (e.g. haven), or missing permission for particular files in your USAID DHS project.

josephholler commented 3 years ago

This is the same issue: https://github.com/GIS4DEV/RP-Malcomb/issues/1 from Emmab:

Hey everyone, I tried to run the R script however ran into an issue when downloading the DHS data. This is the error I'm getting: Error in names(filedatatypelistDHS) <- paste0("filedatatypelist", qdapRegex::rm_between(filedatatypelist_DHS_line, : 'names' attribute [1] must be the same length as the vector [0] Anyone know what this means?

jackson-mumper commented 3 years ago

Mine only seems to be having trouble accessing the GPS data, which I downloaded easily from the website. But with these I'm not seeing any .rds files for these - only .shp, .cpg, .dbf, .prj, .sbn, .sbx, .shx, and some metadata. What exactly is an .rds file? Would one of these work instead?

mtango99 commented 3 years ago

Trying to import the .SAV (SPSS) file and not working:

> library("memisc")
> dataset <- data.frame(as.data.set(spss.system.file("MWHR61FL.SAV")))
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.data.set': object 'encoding' not found

Trying to import the .shp file and not working:

> require(sf)
> shape <- read_sf(dsn = ".", layer = "MWGE62FL.shp")
Cannot open layer MWGE62FL.shp
Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  : 
  Opening layer failed.

Also tried to make a new codeblock and iwhenever I tried running code from it it said:

''[first word of codeblock name]" is not recognized as an internal or external command, operable program or batch file

But if I didn't make a new codeblock sometimes it wouldn't run things, and the text is all black instead of some being colored like in the code block

josephholler commented 3 years ago

All of those files you mention should be kept together, and collectively are a shapefile. They can be imported into the script similarly to other shapefiles, e.g. the Livelihood Zones (see line 93) If data has been downloaded into a zipped folder, it needs to be unzipped. That can be done manually or with code similar to lines 44, 50. Notice that the here function is being used to direct R to the correct folders for loading or saving data by placing each folder name in quotes.

josephholler commented 3 years ago

For SPSS, maybe just use the read.spss() function in the foreign package, which should be installed by default?

In the meantime, I'm going to fire up my desktop and run the sessionInfo() function and report back exactly what R language and packages are installed for this script to have worked. I'm still suspicious that the root cause is out of date package...

josephholler commented 3 years ago

Results of SessionInfo() in an RStudio instance where all the code worked:

R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] s2_1.0.4 ggplot2_3.3.3 readr_1.4.0 rdhs_0.7.1 classInt_0.4-3 [6] here_1.0.1 dplyr_1.0.5 stars_0.5-2 abind_1.4-5 sf_0.9-8
[11] haven_2.3.1 downloader_0.4

loaded via a namespace (and not attached): [1] storr_1.2.5 tinytex_0.31 tidyselect_1.1.0 xfun_0.22
[5] purrr_0.3.4 colorspace_2.0-0 vctrs_0.3.7 generics_0.1.0
[9] htmltools_0.5.1.1 yaml_2.2.1 utf8_1.2.1 rlang_0.4.10
[13] e1071_1.7-6 pillar_1.6.0 glue_1.4.2 withr_2.4.1
[17] DBI_1.1.1 rappdirs_0.3.3 wk_0.4.1 lifecycle_1.0.0
[21] munsell_0.5.0 gtable_0.3.0 evaluate_0.14 knitr_1.32
[25] forcats_0.5.1 parallel_3.6.1 class_7.3-15 fansi_0.4.2
[29] Rcpp_1.0.6 KernSmooth_2.23-15 scales_1.1.1 lwgeom_0.2-6
[33] hms_1.0.0 digest_0.6.27 grid_3.6.1 rprojroot_2.0.2
[37] cli_2.4.0 tools_3.6.1 magrittr_2.0.1 proxy_0.4-25
[41] tibble_3.1.0 crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.1
[45] rmarkdown_2.7 rstudioapi_0.13 R6_2.5.0 units_0.7-1
[49] compiler_3.6.1

mtango99 commented 3 years ago

Looks like the only difference (other than some of my packages are more recent-- and there are a few differences in which packages are on the list) is that I don't have rstudioapi, and when I tried to install it, I got an error message:

> install.packages(rstudioapi)
Error in install.packages : object 'rstudioapi' not found

R version 4.0.5 (2021-03-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] foreign_0.8-81 s2_1.0.4 ggplot2_3.3.3 readr_1.4.0 rdhs_0.7.1
[6] classInt_0.4-3 here_1.0.1 dplyr_1.0.5 stars_0.5-2 abind_1.4-5
[11] sf_0.9-8 haven_2.3.1 downloader_0.4

loaded via a namespace (and not attached): [1] storr_1.2.5 tinytex_0.31 tidyselect_1.1.0 xfun_0.22
[5] purrr_0.3.4 colorspace_2.0-0 vctrs_0.3.7 generics_0.1.0
[9] htmltools_0.5.1.1 yaml_2.2.1 utf8_1.2.1 rlang_0.4.10
[13] e1071_1.7-6 pillar_1.6.0 glue_1.4.2 withr_2.4.2
[17] DBI_1.1.1 rappdirs_0.3.3 wk_0.4.1 lifecycle_1.0.0
[21] munsell_0.5.0 gtable_0.3.0 evaluate_0.14 qdapRegex_0.7.2
[25] knitr_1.32 forcats_0.5.1 curl_4.3 parallel_4.0.5
[29] class_7.3-18 fansi_0.4.2 Rcpp_1.0.6 KernSmooth_2.23-18 [33] scales_1.1.1 lwgeom_0.2-6 jsonlite_1.7.2 brio_1.1.1
[37] hms_1.0.0 digest_0.6.27 stringi_1.5.3 getPass_0.2-2
[41] grid_4.0.5 rprojroot_2.0.2 cli_2.4.0 tools_4.0.5
[45] magrittr_2.0.1 proxy_0.4-25 tibble_3.1.0 crayon_1.4.1
[49] pkgconfig_2.0.3 ellipsis_0.3.1 httr_1.4.2 rstudioapi_0.13
[53] assertthat_0.2.1 rmarkdown_2.7 R6_2.5.0 units_0.7-1
[57] compiler_4.0.5

mtango99 commented 3 years ago

I got most of the code to work, uploading the data manually.

Chunk 8 is giving me an error:

> ta_brks = filter(ta, !is.na(capacity_2010)) %>% {classIntervals(.$capacity_2010, 4, style = "jenks")$brks} # did not work
n greater than number of different finite values\nn reset to number of different finite valuesn same as number of different finite values\neach different finite value is a separate classError in sVar[1:(length(sVar) - 1)] : 
  only 0's may be mixed with negative subscripts

The code I used to upload data:

survey <- data.frame(read.spss(here("data", "raw", "private", "MWHR61SV", "MWHR61FL.SAV")), stringsAsFactors=FALSE)

survey2 <- data.frame(lapply(survey, function(x) as.numeric(as.character(x))))

require(sf)
shape = read_sf(here("data", "raw", "private","MWGE62FL", "MWGE62FL.shp")) %>%
  st_make_valid()

survey2 is just an edit of survey because I needed to convert factors to vectors (numbers) so I could later run a sum on some columns. I then used survey2 to create the dhshh_2010 layer. Also needed it to be a dataframe as opposed to a list, so added data.frame() function.

Citation for survey2 code: https://stackoverflow.com/questions/23915131/change-all-columns-from-factor-to-numeric-in-r

josephholler commented 3 years ago

If the error is on the classIntervals function, it looks like it's complaining that you're asking for 4 classes but have fewer values to classify than that. Check the ta data frame: it should have 256 observations with data for 222 of them in the capacity_2010 column for 222 rows (others are NA)

mtango99 commented 3 years ago

It looks like all of the capacity_2010s are NULL because my join didn't work ("r capacity in traditional authorities 2010") chunk, creating an empty table

mtango99 commented 3 years ago

I'm wondering if applying the "survey2 <- data.frame(lapply(survey, function(x) as.numeric(as.character(x))))" line to the whole survey messed things up? It's weird that things worked up until I tried to do the join to create the ta_capacity_2010 variable.

mtango99 commented 3 years ago

Used the haven package command to upload survey data and it all worked. Thanks Kufre!

survey4 <- read_sav((here("data", "raw", "private", "MWHR61SV", "MWHR61FL.SAV")))
josephholler commented 3 years ago

Relatedly, Emma posted some solution code here: https://github.com/GIS4DEV/RP-Malcomb/issues/1#issuecomment-824229072

sanjana-roy commented 3 years ago

Hi all, bit late with finishing the report! I had to clear my global environment in R recently and made the mistake of not saving my final figures so I have to run the code all over again. After running this line of code:

set_rdhs_config( email = email, project = project, config_path = rdhs_json, global = FALSE, cache_path = here("data","raw","private") )

I'm getting this error and it seems a little different from what was discussed here earlier:

cannot create file '/Users/sanjanaroy/data/raw/private/rdhs.json', reason 'No such file or directory'[1] FALSE Error in set_rdhs_config(email = email, project = project, config_path = rdhs_json, : For a local configuration, 'config_path' must be 'rdhs.json' or 'config_path' must already exist and end 'rdhs.json'

Would this be the case where I have to clear everything out and load stuff in through R again? Because I already have my rdhs.json files in the folder.

sanjana-roy commented 3 years ago

Oh that's probably because the path for "data/raw/private" is unclear but I do have my R script in the repository. Is there any way to specify the file path for all 'here' functions we use without repeating the entire path?

josephholler commented 3 years ago

Three thoughts:

1) answering your last one, save the R Project into the root of your repository directory. Here picks up default path names from wherever the project is saved. 2) set_rdhs_config is trying to write a new rdhs.json file, which may already exist on your computer. try deleting that file? 2) If you have already run all the rdhs code, then you already have datasets downloaded on your computer in the raw/private folder. the challenge is just to read in the village survey points and the household survey data. Emma Brown, Maddie, and a few others have done that with a little bit different code, and their examples are in this thread or in the issues of the RP-Malcomb repository.

sanjana-roy commented 3 years ago

Just had to re-open R Studio and everything seemed to work fine! I think it might have possibly gotten confused with the directories when running the Wednesday lab code.

josephholler commented 3 years ago

Summer 2021, we got this to work more reliably by updating all R packages, including rdhs.