Some na.strings are probably missing

lgnbhl commented 6 years ago

Firstly, thank you for this very useful package!

I got an error when using pxR::read.px in order to read some PX files from the Swiss Federal Statistical Office (or BFS) online database (https://www.pxweb.bfs.admin.ch/).

I presume that the error comes from a missing na.strings from the pxR::read.px function: "....." (5 dots)

Would it be possible to fix this problem? Many thanks in advance!

Example

library(pxR)
url <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1604000000_104"
dataset <- pxR::read.px(url)

## Error in scan(tc, na.strings = na.strings, quote = NULL, quiet = TRUE) :                                             
## scan() attendait 'a real' et a reçu '"....."'

martinzbinden commented 6 years ago

I get the same error when trying to read this other file from Swiss Federal Statistical Office (or BFS): https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0702000000_104

lgnbhl commented 6 years ago

Hello Martin Zbinden,

I made a fork of the pxR package in order to make it compatible with the Swiss Federal Statistical Office (or BFS). My fork is just the result of my Pull Request.

Just try this code:

library(devtools)
install_github("lgnbhl/pxR", force = TRUE) # fork making pxR compatible with BFS 

library(pxR)
url <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0702000000_104"
dataset <- pxR::read.px(url)`

Let me know if it works :-)

statzg commented 6 years ago

I have the same problem with bfs.admin.ch files. In my case it's "......" (six dots) which creates the problem. This would be fixed with including "....." and "......" as na.strings. I've submitted a pull request.

jay-sf commented 3 years ago

Hi @lgnbhl, I just came across your fork but it still does not work with this BFS data:

pxR::read.px("https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1503040100_101")
# Warning in scan(filename, what = "character", sep = "\n", quiet = TRUE,  :
#   invalid input found on input connection 'https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1503040100_101'
# Error in pxR::read.px("https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1503040100_101") : 
#   The input file is malformed: data and varnames length differ

Any clues why this is happening? Sorry to address you, I'm not sure how/where to file this.

Cheers

PS: Ref.: https://www.bfs.admin.ch/bfs/de/home/statistiken/bildung-wissenschaft/bildungsabschluesse/tertiaerstufe-hochschulen/universitaere.assetdetail.13147037.html in case I got the link wrong, but I also tried on the downloaded data with the same warning/error

lgnbhl commented 3 years ago

Hi @jaySf,,

My guess is that pxR::read.px() fails to read PX files from BFS with Windows. Sometimes the function works fine with Mac and Linux but not always... I don't fully understand why and I didn't find a quick fix for it. I will remove my old fork as it doesn't solve this issue.

Note also that I have the same issue that you have using pxR::read.px() in my R package which help to automate the extraction of data from the BFS: https://github.com/lgnbhl/BFS/issues/3.

jay-sf commented 3 years ago

@lgnbhl Thanks for your fast reply! Really strange, perhaps I try it on my linux machine later. Great, didn't know there was a BFS package! Too sad the issue with pxR::read.px()

statzg commented 3 years ago

Hi there, I've been successful reading in px-files in Windows from BFS if I prepare them a little before reading them in:

#Read in file an convert encoding
x <- iconv(readLines(paste(folder, file, sep="/"), encoding="CP1252 "), from="CP1252 ", to="Latin1", sub="")

#Replace missings to workaround a bug in pxR.
x <- gsub("\"......\"", "\"....\"", x, fixed = TRUE)
x <- gsub("\".....\"", "\"....\"", x, fixed = TRUE)

#Write the file with the changes
fileConn<-file(paste(folder, file, sep="/"))
writeLines(x, con=fileConn, useBytes = TRUE)
close(fileConn)

Depending on the size of the px-File this takes a while.

It seems that pxR has a problem with "......". Hope this helps.

lgnbhl commented 3 years ago

Hi @statzg ,

Thank you very much for sharing your fix! I will implement it in my BFS package.

ValParCH commented 2 years ago

Hi @statzg, I have been using your trick and it worked well, but it seems that it didn't work anymore when I tried with some other data from the BFS, and then it didn't work with older codes that used to work. I don't know to what it is due, but I got this message:

file<-"px-x-0702000000_102_copy.px"
x <- iconv(readLines(paste(pt, file, sep="/"), encoding="CP1252 "), from="CP1252 ", to="Latin1", sub="")
x <- gsub("\"......\"", "\"....\"", x, fixed = TRUE)
x <- gsub("\".....\"", "\"....\"", x, fixed = TRUE)
fileConn<-file(paste(pt, file, sep="/"))
writeLines(x, con=fileConn, useBytes = TRUE)
close(fileConn)
data = read.px(paste(pt,file,sep="/"), na.strings = c('"."','".."','"..."','"...."','"....."','"....."','":"'))

#Error in stri_length(string) : 
#invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()

I found that converting from UTF-8 to latin1 did the trick though, so if anyone experiences the same issue, here's what worked for me:

file<-"px-x-0702000000_102.px"

x <- iconv(readLines(paste(pt, file, sep="/"), encoding="UTF-8"), from="UTF-8", to="Latin1", sub="")
x <- gsub("\"......\"", "\"....\"", x, fixed = TRUE)
x <- gsub("\".....\"", "\"....\"", x, fixed = TRUE)

#Write the file with the changes
fileConn<-file(paste(pt, file, sep="/"))
writeLines(x, con=fileConn, useBytes = TRUE)
close(fileConn)
data = read.px(paste(pt,file,sep="/"), na.strings = c('"."','".."','"..."','"...."','"....."','"....."','":"'))

Thanks again! Best

cjgb / pxR

Some na.strings are probably missing #1

Example