Closed majazaloznik closed 6 years ago
hi, i don't have access to most of dhs so it's difficult for me to debug. could you debug( lodown:::lodown_dhs )
and see if there's an obvious fix?
Oh dear, so originally I encountered this error on a Windows machine, which threw an error and exited (I don't have access to that machine now to give details). But on the ubuntu laptop i'm using now there is no error, but all it does is unzips the file. And it took me while to figure it out why this is...
But I think i've got it though: the reason is that linux is case sensitive and Windows isn't. So when I unzipped the SNxx7Qxx.zip file, this produced a second SNxx7Qxx.ZIP file, same name, different case, and this was presumably the tripping point.
On linux the fact that the extensions are in different case means there is no error, but also no extraction of the data files. So insideload_fun()
the unzipped_files
variable holds c("SNxx7Qxx.ZIP", "SNxx6Rxx.ZIP")
. And then both checks for either .dta
and .sav
files return nothing, so nothing happens:
st <- grepl("\\.sav$", tolower(unzipped_files)))) {...
st
is always c(FALSE, FALSE)
.
So what you want is immediately after the unzipped_files <-
line to first check if there is a zipped file in unzipped_files
, the same one as before, and if so, unzip it again, and then continue as before, check for .sav
or .dta
... But this only works because the top level .zip extension is in lower case, and the second one in upper case. And it would not work on a case insensitive OS...
It is mean to be the only exception, these two Senegal files - I checked: [https://userforum.dhsprogram.com/index.php?t=msg&th=6644&#msg_13768]. Still, the reasoning is completely inexplicable to me - they are merging row-wise two surveys, which you can download individually..
Anyway, that's all I've got the skills to uncover I'm afraid, but happy to test (on linux) ;)
thanks for looking at this so carefully! does the ignore.case=TRUE
that i just added solve the issue?
No, that doesn't solve it I'm afraid, this would: after line 172 insert:
if (any(st <- grepl(paste0(catalog[i, "filecode"], ".ZIP"), unzipped_files))) {
unzipped_files <- unzip_warn_fail(paste0(catalog[i, "output_folder"], "/",
catalog[i, "filecode"], ".ZIP"),
exdir = catalog[i, "output_folder"])
}
This checks if there there is a ZIP
file with a capitalised extension: this can only happen if one of the two offending Senegal files have just been unzipped, and not otherwise.
If found, the second one is now unzipped and overwrites the unzipped_files
. Then continue as before.
This will still not work on Windows, where the first call to unzip_warn_fail()
already exits.
does this commit solve it? thanks
Close, but not quite: [See, I was only trying to unzip the one file, the one in the catalog--the second one is redundant anyway]
But fair enough to extract everything! (Although then you have to rethink the whole catalog thing - the number of files will not match the number of rows)
Anyway: there's still a problem: then what happens is you have two .dta
or .sav
files, so the st
variable has two TRUE instances.
And that means you're passing two files to haven::read_
x <- data.frame(haven::read_dta(unzipped_files[which(st)]))
if https://github.com/ajdamico/lodown/commit/bca20c3087fe3bdd3ea96c8f0f0281eb81365aee doesn't solve it, could you submit a pull request that does? thanks
On Wed, Dec 20, 2017 at 3:50 PM, maja notifications@github.com wrote:
Close, but not quite: [See, I was only trying to unzip the one file, the one in the catalog--the second one is redundant anyway]
But fair enough to extract everything! (Although then you have to rethink the whole catalog thing - the number of files will not match the number of rows)
Anyway: there's still a problem: then what happens is you have two .dta or .sav files, so the st variable has two TRUE instances.
And that means you're passing two files to haven::read_
x <- data.frame(haven::read_dta(unzipped_files[which(st)]))
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ajdamico/lodown/issues/126#issuecomment-353178901, or mute the thread https://github.com/notifications/unsubscribe-auth/AANO5z7aVnKO1gf_Q0M44eECWqAymUoBks5tCXMlgaJpZM4RIgKh .
lots of edits in the past few days. i tried your account as well and everything worked till this point, which seemed like a file that ought to be skipped (my server has 700gb ram)
x <- haven::read_dta("C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA")
Error in df_parse_dta_file(spec, encoding) :
Failed to parse C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA: Unable to allocate memory.
weird, it works with foreign::read.dta
, 5277 observations of 48
variables.
i don't have access to stata here to check anything more now, will try
tomorrow.
On 10 January 2018 at 22:18, Anthony Damico notifications@github.com wrote:
lots of edits in the past few days. i tried your account as well and everything worked till this point, which seemed like a file that ought to be skipped (my server has 700gb ram)
x <- haven::read_dta("C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA") Error in df_parse_dta_file(spec, encoding) : Failed to parse C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA: Unable to allocate memory.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ajdamico/lodown/issues/126#issuecomment-356739658, or mute the thread https://github.com/notifications/unsubscribe-auth/AHaQrcBMEfZXHuazU_cYMpaQeFKO9woKks5tJSkQgaJpZM4RIgKh .
not sure how helpful this is:
so the NPOD01FL.DTA file was saved in stata verison 6 [dta version 108], while according to their documentation haven::read_dta supports versions 8
this could explain what happened--since the file opens with foreign::read.dta and the other Nepal file from the same year is saved as stata v8|9 [dta 113], which is why it would work with haven::read_dta
still:
so maybe 108 would not throw an error? (i don't understand any of this, just a guess.)
unfortunately i don't have access to any other stata 6[108] version files to test. i even tried some older dhs files, and they all seem to be v8|9 or later. this may well be an issue to raise with haven, but for you prob easiest to hardcode this file to use foreign::read.dta instead (although mind you, the conversion to data.frames is not identical...)
On 10 January 2018 at 21:56, maja zaloznik maja.zaloznik@gmail.com wrote:
weird, it works with
foreign::read.dta
, 5277 observations of 48 variables. i don't have access to stata here to check anything more now, will try tomorrow.On 10 January 2018 at 22:18, Anthony Damico notifications@github.com wrote:
lots of edits in the past few days. i tried your account as well and everything worked till this point, which seemed like a file that ought to be skipped (my server has 700gb ram)
x <- haven::read_dta("C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA") Error in df_parse_dta_file(spec, encoding) : Failed to parse C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA: Unable to allocate memory.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ajdamico/lodown/issues/126#issuecomment-356739658, or mute the thread https://github.com/notifications/unsubscribe-auth/AHaQrcBMEfZXHuazU_cYMpaQeFKO9woKks5tJSkQgaJpZM4RIgKh .
Hi Anthony! I've been using lodown_dhs() and it works beautifully, except.. there seem to be two files that are anomalous: two Senegal files have an extra level of folders - one that we are expecting, and another 'merged' one, that I am not clear exactly why it's there.. Anyway, the offending files, with the subdirectories in square brackets are: