ProjectMOSAIC / mosaic

Project MOSAIC R package
http://mosaic-web.org/
94 stars 25 forks source link

CIAdata() appears to be broken #441

Closed rpruim closed 9 years ago

rpruim commented 9 years ago
> gdpData <- CIAdata("GDP")      # load some world data
Error in names(table) <- c("country", name) : 
  'names' attribute [2] must be the same length as the vector [0]

A little debugging leads me to suspect that the file format has changed. We are trying to read files with

  table <- read.delim(textConnection(RCurl::getURL(url, ssl.verifypeer = FALSE)),
                      header = FALSE, stringsAsFactors = FALSE)

which suggests we are expecting tab-delimited files. But the files appear to be fixed width formatted, and I don't know how we get the column widths.

It looks like using a double space as separator would work (at least for this file https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2001.txt), but the read.* functions appear to only allow one-character separators.

My current solution is to replace

  table <- read.delim(textConnection(RCurl::getURL(url, ssl.verifypeer = FALSE)),
                      header = FALSE, stringsAsFactors = FALSE)

with

  lines <- readLines(textConnection(RCurl::getURL(url, ssl.verifypeer = FALSE)))
  table <- as.data.frame(do.call(rbind, strsplit( lines, "  +")), stringsAsFactors=FALSE)

This seems to be working in limited testing

> gdpData <- CIAdata("GDP")      # load some world data
Loading required namespace: RCurl
> head(gdpData)
         country      GDP
1          China 1.76e+13
2 European Union 1.76e+13
3  United States 1.75e+13
4          India 7.28e+12
5          Japan 4.81e+12
6        Germany 3.62e+12

But I don't know with certainty that it will work on all files.

rpruim commented 9 years ago

A little more testing

> res <- Map(CIAdata, name=CIAdata()$Name)
Warning messages:
1: In (function (..., deparse.level = 1)  :
  number of columns of result is not a multiple of vector length (arg 1)
2: In (function (..., deparse.level = 1)  :
  number of columns of result is not a multiple of vector length (arg 2)
3: In (function (name = NULL)  : NAs introduced by coercion
> length(res)
[1] 75
> sapply(res, nrow) %>% favstats
 min  Q1 median  Q3 max     mean       sd  n missing
   3 181    211 219 257 193.0667 43.05412 75       0
> sapply(res, nrow) %>% sort %>% head
   inflation  abroadStock    waterways    homeStock     HIVdeath       shares 
           3          100          107          111          119          120 

Things seem good for all but inflation:

> CIAdata("inflation")
Retrieving data from https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2092.txt
                                                                                                                                                                 country
1                                                                                                                              (GEOCODE,CODEDESC,NUMFACT,TEXT,FIELDDESC)
2 Insert into FACTBOOK.RO_2092 (GEOCODE,CODEDESC,NUMFACT,TEXT,FIELDDESC) values ('BK','Bosnia and Herzegovina',' -0.80','2014 est.','Inflation rate (consumer prices)');
3               Insert into FACTBOOK.RO_2092 (GEOCODE,CODEDESC,NUMFACT,TEXT,FIELDDESC) values ('LS','Liechtenstein',' -0.70','2012','Inflation rate (consumer prices)');
  inflation
1        NA
2         2
3        NA
Warning messages:
1: In (function (..., deparse.level = 1)  :
  number of columns of result is not a multiple of vector length (arg 2)
2: In CIAdata("inflation") : NAs introduced by coercion

This seems to be messed up at CIA. The HTML page is bad too.

I'm going to leave this in our index of available data sets for now, in case the data return. But we could just delete that row from CIA.rda