ajdamico / lodown

locally download and prepare publicly-available microdata
GNU General Public License v3.0
97 stars 47 forks source link

Error importing basic monthly CPS: "the_result[i, "start_position"] == the_result[i - 1, "end_position"] + .... is not TRUE" #138

Closed ernietedeschi closed 6 years ago

ernietedeschi commented 6 years ago

Not doing anything fancy here:

library(lodown)
cpsbasic_cat <-
        get_catalog( "cpsbasic" ,
                     output_dir = file.path( path.expand( "~" ) , "CPSBASIC" ) )

cpsbasic_cat <- subset( cpsbasic_cat , year == 2017 & month == 4 )
cpsbasic_cat <- lodown( "cpsbasic" , cpsbasic_cat )

Here's the error:

Error: the_result[i, "start_position"] == the_result[i - 1, "end_position"] +  .... is not TRUE
   year month                                                                                    dd version                                                            full_url
13 2017     4 https://thedataweb.rm.census.gov/pub/cps/basic/201701-/January_2017_Record_Layout.txt 201501- https://thedataweb.rm.census.gov/pub/cps/basic/201501-/apr17pub.zip
                                       output_filename case_count
13 /Users/ernietedeschi/CPSBASIC/2017 04 cps basic.rds         NA
ajdamico commented 6 years ago

i'm not able to reproduce this, but i believe you :) could you make sure you have the latest version of lodown, and then tell me why this is breaking?

library(lodown)
cpsbasic_cat <-
        get_catalog( "cpsbasic" ,
                     output_dir = file.path( path.expand( "~" ) , "CPSBASIC" ) )

debug(lodown:::cps_dd_parser)
lodown:::cps_dd_parser( subset( cpsbasic_cat , year == 2017 & month == 4 )$dd )
ernietedeschi commented 6 years ago

Looks like it's getting hung up on three variables in the_result: PEDISREM, PEDISOUT, and further down, PECERT3. You can see below that for those three, the condition that start_position[i] = end_position[i-1]+1 doesn't hold.

varname | width | start_position | end_position | divisor
PXCOHAB | 2 | 904 | 905 | 1.00E+00
PEDISREM | 2 | 910 | 911 | 1.00E+00
PEDISOUT | 2 | 916 | 917 | 1.00E+00
PRDISFLG | 2 | 918 | 919 | 1.00E+00
...
...

PTNMEMP2 | 2 | 942 | 943 | 1.00E+00
PECERT3 | 2 | 948 | 949 | 1.00E+00
PXCERT1 | 2 | 950 | 951 | 1.00E+00
ajdamico commented 6 years ago

great! i'm sure this has something to do with the three dot special character in

PEDISEAR        2   IS…DEAF OR DOES…HAVE SERIOUS                    906 - 907

in the data dictionary. http://ceprdata.org/wp-content/cps/CPS_Basic_Data_Dictionary_2015.txt

could you figure out why the columns in the data dictionary are being wiped out on your machine, and what change we could make so they're maintained?

ernietedeschi commented 6 years ago

OK. I will dig in. In case it’s relevant, I’m running this in macOS High Sierra.

ernietedeschi commented 6 years ago

Looks like I have to eliminate two more special characters.

the_lines <- gsub("\u0085", "X", the_lines)
the_lines <- gsub("\\u0085", "X", the_lines)
the_lines <- gsub("\\\u0085", "X", the_lines)
the_lines <- gsub("\u0092", "X", the_lines)
the_lines <- gsub("\\u0092", "X", the_lines)
the_lines <- gsub("\\\u0092", "X", the_lines)

the_dd <- gsub("\u0085", "X", the_dd)
the_dd <- gsub("\\u0085", "X", the_dd)
the_dd <- gsub("\\\u0085", "X", the_dd)
the_dd <- gsub("\u0092", "X", the_dd)
the_dd <- gsub("\\u0092", "X", the_dd)
the_dd <- gsub("\\\u0092", "X", the_dd)
ajdamico commented 6 years ago

nice! could you send a pull request?

ernietedeschi commented 6 years ago

See PR #140

ajdamico commented 6 years ago

thanks a lot