(at least) two dhs files are anomalous

majazaloznik commented 6 years ago

Hi Anthony! I've been using lodown_dhs() and it works beautifully, except.. there seem to be two files that are anomalous: two Senegal files have an extra level of folders - one that we are expecting, and another 'merged' one, that I am not clear exactly why it's there.. Anyway, the offending files, with the subdirectories in square brackets are:

SNxx7Qxx.zip [ SNxx7Qxx and SNxxG0xx]
SNxx70xx.zip [SNxx70xx and SNxx6Rxx] I know these are the only two anomalies in the DHS IV and newer surveys, but I wasn't looking at older ones, so I cannot be sure. Is there a simple way to fix this? it seems crazy that there is such an inconsistency, but presumably there are/will be more, so kludging just for Senegal won't last long.. Great work btw, thank you so much! m.

ajdamico commented 6 years ago

hi, i don't have access to most of dhs so it's difficult for me to debug. could you debug( lodown:::lodown_dhs ) and see if there's an obvious fix?

majazaloznik commented 6 years ago

Oh dear, so originally I encountered this error on a Windows machine, which threw an error and exited (I don't have access to that machine now to give details). But on the ubuntu laptop i'm using now there is no error, but all it does is unzips the file. And it took me while to figure it out why this is...

But I think i've got it though: the reason is that linux is case sensitive and Windows isn't. So when I unzipped the SNxx7Qxx.zip file, this produced a second SNxx7Qxx.ZIP file, same name, different case, and this was presumably the tripping point.

On linux the fact that the extensions are in different case means there is no error, but also no extraction of the data files. So insideload_fun() the unzipped_files variable holds c("SNxx7Qxx.ZIP", "SNxx6Rxx.ZIP"). And then both checks for either .dta and .sav files return nothing, so nothing happens:

st <- grepl("\\.sav$", tolower(unzipped_files)))) {...

st is always c(FALSE, FALSE).

So what you want is immediately after the unzipped_files <- line to first check if there is a zipped file in unzipped_files, the same one as before, and if so, unzip it again, and then continue as before, check for .sav or .dta... But this only works because the top level .zip extension is in lower case, and the second one in upper case. And it would not work on a case insensitive OS...

It is mean to be the only exception, these two Senegal files - I checked: [https://userforum.dhsprogram.com/index.php?t=msg&th=6644&#msg_13768]. Still, the reasoning is completely inexplicable to me - they are merging row-wise two surveys, which you can download individually..

Anyway, that's all I've got the skills to uncover I'm afraid, but happy to test (on linux) ;)

ajdamico commented 6 years ago

thanks for looking at this so carefully! does the ignore.case=TRUE that i just added solve the issue?

majazaloznik commented 6 years ago

No, that doesn't solve it I'm afraid, this would: after line 172 insert:

if (any(st <- grepl(paste0(catalog[i, "filecode"], ".ZIP"), unzipped_files))) {
  unzipped_files <- unzip_warn_fail(paste0(catalog[i, "output_folder"], "/",
                                           catalog[i, "filecode"], ".ZIP"), 
                                    exdir = catalog[i, "output_folder"])
}

This checks if there there is a ZIP file with a capitalised extension: this can only happen if one of the two offending Senegal files have just been unzipped, and not otherwise. If found, the second one is now unzipped and overwrites the unzipped_files. Then continue as before.

This will still not work on Windows, where the first call to unzip_warn_fail() already exits.

ajdamico commented 6 years ago

does this commit solve it? thanks

majazaloznik commented 6 years ago

Close, but not quite: [See, I was only trying to unzip the one file, the one in the catalog--the second one is redundant anyway]

But fair enough to extract everything! (Although then you have to rethink the whole catalog thing - the number of files will not match the number of rows)

Anyway: there's still a problem: then what happens is you have two .dta or .sav files, so the st variable has two TRUE instances.

And that means you're passing two files to haven::read_

x <- data.frame(haven::read_dta(unzipped_files[which(st)]))

ajdamico commented 6 years ago

if https://github.com/ajdamico/lodown/commit/bca20c3087fe3bdd3ea96c8f0f0281eb81365aee doesn't solve it, could you submit a pull request that does? thanks

On Wed, Dec 20, 2017 at 3:50 PM, maja notifications@github.com wrote:

Close, but not quite: [See, I was only trying to unzip the one file, the one in the catalog--the second one is redundant anyway]

But fair enough to extract everything! (Although then you have to rethink the whole catalog thing - the number of files will not match the number of rows)

Anyway: there's still a problem: then what happens is you have two .dta or .sav files, so the st variable has two TRUE instances.

And that means you're passing two files to haven::read_

x <- data.frame(haven::read_dta(unzipped_files[which(st)]))

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ajdamico/lodown/issues/126#issuecomment-353178901, or mute the thread https://github.com/notifications/unsubscribe-auth/AANO5z7aVnKO1gf_Q0M44eECWqAymUoBks5tCXMlgaJpZM4RIgKh .

ajdamico commented 6 years ago

lots of edits in the past few days. i tried your account as well and everything worked till this point, which seemed like a file that ought to be skipped (my server has 700gb ram)

x <- haven::read_dta("C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA")
Error in df_parse_dta_file(spec, encoding) : 
  Failed to parse C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA: Unable to allocate memory.

majazaloznik commented 6 years ago

weird, it works with foreign::read.dta, 5277 observations of 48 variables. i don't have access to stata here to check anything more now, will try tomorrow.

On 10 January 2018 at 22:18, Anthony Damico notifications@github.com wrote:

lots of edits in the past few days. i tried your account as well and everything worked till this point, which seemed like a file that ought to be skipped (my server has 700gb ram)

x <- haven::read_dta("C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA") Error in df_parse_dta_file(spec, encoding) : Failed to parse C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA: Unable to allocate memory.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ajdamico/lodown/issues/126#issuecomment-356739658, or mute the thread https://github.com/notifications/unsubscribe-auth/AHaQrcBMEfZXHuazU_cYMpaQeFKO9woKks5tJSkQgaJpZM4RIgKh .

majazaloznik commented 6 years ago

not sure how helpful this is:

so the NPOD01FL.DTA file was saved in stata verison 6 [dta version 108], while according to their documentation haven::read_dta supports versions 8

14 and foreign::read.dta works for versions 5-12.

this could explain what happened--since the file opens with foreign::read.dta and the other Nepal file from the same year is saved as stata v8|9 [dta 113], which is why it would work with haven::read_dta

still:

haven should have a ""This version of the file format is not supported" error
also, while haven docs say versions 8-14, if you look here http://haven/src/readstat/stata/readstat_dta.c:
define DTA_MIN_VERSION 104

define DTA_MAX_VERSION 118

so maybe 108 would not throw an error? (i don't understand any of this, just a guess.)

unfortunately i don't have access to any other stata 6[108] version files to test. i even tried some older dhs files, and they all seem to be v8|9 or later. this may well be an issue to raise with haven, but for you prob easiest to hardcode this file to use foreign::read.dta instead (although mind you, the conversion to data.frames is not identical...)

On 10 January 2018 at 21:56, maja zaloznik maja.zaloznik@gmail.com wrote:

weird, it works with foreign::read.dta, 5277 observations of 48 variables. i don't have access to stata here to check anything more now, will try tomorrow.

On 10 January 2018 at 22:18, Anthony Damico notifications@github.com wrote:

lots of edits in the past few days. i tried your account as well and everything worked till this point, which seemed like a file that ought to be skipped (my server has 700gb ram)

x <- haven::read_dta("C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA") Error in df_parse_dta_file(spec, encoding) : Failed to parse C:/Users/anthonyd/AppData/Local/Temp/15/Rtmpa4XjCw/Nepal/NPOD01FL.DTA: Unable to allocate memory.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ajdamico/lodown/issues/126#issuecomment-356739658, or mute the thread https://github.com/notifications/unsubscribe-auth/AHaQrcBMEfZXHuazU_cYMpaQeFKO9woKks5tJSkQgaJpZM4RIgKh .

ajdamico / lodown

(at least) two dhs files are anomalous #126

define DTA_MIN_VERSION 104

define DTA_MAX_VERSION 118