Closed alfredjmduncan closed 1 month ago
Update Mostly, the suggested changes above work. There is one issue. The new PSIDCodebook.xml file posted by psid doesn't fully match their psid.xlsx file. For the following entries in their psid.xlsx file, there is no corresponding .xml entry (though I believe they are in the datasets, and PSIDCodebook.xml does refer to these variables in other entries). Their absence results in an error when constructing datasets that use these variables.
Y1999 | Y2001 | Y2003 | Y2005 | Y2007 | Y2009 | Y2011 | Y2013 | Y2015 | Y2017 | |
---|---|---|---|---|---|---|---|---|---|---|
Consumption | ER71527C | |||||||||
Expenditure | ER16515D7 | ER20456D7 | ER24138D7 | ER28037E4 | ER41027E4 | ER46971E4 | ER52395E4 | ER58212E4 | ER65448B | ER71527B |
I'm not sure what is the best way to deal with this. For now, I've just removed these entries from my own copy of psid.xlsx. But that's not ideal. An alternative would for psid.jl to construct minimal codebook entries where there is a variable that is in the psid.xlsx and in the data files but is not in PSIDCodebook.xml. Another option would be for PSID.jl to just drop any unmatched variable codes from psid.xlsx and throw a warning rather than an error.
Interesting. If you could submit some code on how it might construct a minimal codebook, I could try it. Otherwise dropping the unmatched variable codes would probably be simplest.
I'll try to look at this in the next week
Changes here, seems to work fine https://github.com/aaowens/PSID.jl/pull/48
Thanks, Andrew
I've tested this issue on the 2021 branch. I think you might be using older vintages of fam1999er.zip-fam2017er.zip. Once those are updated, and if you include the following in extractdata.json
, that will reproduce the error I'm getting. All my hashes line up with yours aside from for those waves.
{
"name_user": "expconsumption",
"varID": "ER77588",
"unit": "family"
},
{
"name_user": "exptotal",
"varID": "ER77587",
"unit": "family"
},
The problem seems to be that these total spend and total consumption variables were recently added to those 99-17 waves. But these variables don't have full entries in the .xml codebook. I agree dropping unmatched variable codes (with a warning) is the simplest solution.
I've got an example running here, where additional codebook entries are constructed from the entry in the json user input. For example, it can construct the missing cb entry for ER16515D7
based on the existing cb entry for ER77587
, which it just takes from the user input json file, and it takes the year from psid.xlsx
. I'm not sure its a better solution than just dropping those values and throwing a warning. It certainly isn't efficient. But it works pretty well for my use case.
I guess the biggest potential problem with this approach is that the coding of missing values could differ in the waves without a codebook entry compared with the waves with a codebook entry for a given variable.
Example user input json here examples/missing_cb_entries.json
Looking at this again, it looks like I need to
Hi Andrew, just chiming in here to say that an update to include the latest PSID data would be awesome and that I'd personally be fine without those problematic variables. Thank you very much for your great work!
Yes, I forgot about this. I made the changes on https://github.com/aaowens/PSID.jl/pull/48 and will merge it soon
Should be good now, I tagged a new version
There are a few small changes required to support the 2021 wave, and also to support some changes to aggregated quantities in other recent waves.
PSID have this week restored the link https://simba.isr.umich.edu/downloads/PSIDCodebook.zip (referred to in the readme) from which you can now again download the latest xml codebook.
I think there are just two lines in the code that need updating (in addition to the hashes for recent waves)
https://github.com/aaowens/PSID.jl/blob/95920cc2414c2943e7aa061e269c5dbf0c877eec/src/construct_alldata.jl#L22
https://github.com/aaowens/PSID.jl/blob/95920cc2414c2943e7aa061e269c5dbf0c877eec/src/unzip_data.jl#L76
I haven't tested these changes yet but was hoping to do so next week and submit a PR if that would be helpful.