aaowens / PSID.jl

Quickly assemble data from the Panel Study of Income Dynamics (PSID)
MIT License
25 stars 9 forks source link

Update for 2021 wave #47

Closed alfredjmduncan closed 1 month ago

alfredjmduncan commented 8 months ago

There are a few small changes required to support the 2021 wave, and also to support some changes to aggregated quantities in other recent waves.

PSID have this week restored the link https://simba.isr.umich.edu/downloads/PSIDCodebook.zip (referred to in the readme) from which you can now again download the latest xml codebook.

I think there are just two lines in the code that need updating (in addition to the hashes for recent waves)

https://github.com/aaowens/PSID.jl/blob/95920cc2414c2943e7aa061e269c5dbf0c877eec/src/construct_alldata.jl#L22

https://github.com/aaowens/PSID.jl/blob/95920cc2414c2943e7aa061e269c5dbf0c877eec/src/unzip_data.jl#L76

I haven't tested these changes yet but was hoping to do so next week and submit a PR if that would be helpful.

alfredjmduncan commented 8 months ago

Update Mostly, the suggested changes above work. There is one issue. The new PSIDCodebook.xml file posted by psid doesn't fully match their psid.xlsx file. For the following entries in their psid.xlsx file, there is no corresponding .xml entry (though I believe they are in the datasets, and PSIDCodebook.xml does refer to these variables in other entries). Their absence results in an error when constructing datasets that use these variables.

Y1999 Y2001 Y2003 Y2005 Y2007 Y2009 Y2011 Y2013 Y2015 Y2017
Consumption ER71527C
Expenditure ER16515D7 ER20456D7 ER24138D7 ER28037E4 ER41027E4 ER46971E4 ER52395E4 ER58212E4 ER65448B ER71527B

I'm not sure what is the best way to deal with this. For now, I've just removed these entries from my own copy of psid.xlsx. But that's not ideal. An alternative would for psid.jl to construct minimal codebook entries where there is a variable that is in the psid.xlsx and in the data files but is not in PSIDCodebook.xml. Another option would be for PSID.jl to just drop any unmatched variable codes from psid.xlsx and throw a warning rather than an error.

aaowens commented 8 months ago

Interesting. If you could submit some code on how it might construct a minimal codebook, I could try it. Otherwise dropping the unmatched variable codes would probably be simplest.

I'll try to look at this in the next week

aaowens commented 8 months ago

Changes here, seems to work fine https://github.com/aaowens/PSID.jl/pull/48

alfredjmduncan commented 8 months ago

Thanks, Andrew I've tested this issue on the 2021 branch. I think you might be using older vintages of fam1999er.zip-fam2017er.zip. Once those are updated, and if you include the following in extractdata.json, that will reproduce the error I'm getting. All my hashes line up with yours aside from for those waves.

{
    "name_user": "expconsumption",
    "varID": "ER77588",
    "unit": "family"
},
{
    "name_user": "exptotal",
    "varID": "ER77587",
    "unit": "family"
},

The problem seems to be that these total spend and total consumption variables were recently added to those 99-17 waves. But these variables don't have full entries in the .xml codebook. I agree dropping unmatched variable codes (with a warning) is the simplest solution.

alfredjmduncan commented 8 months ago

I've got an example running here, where additional codebook entries are constructed from the entry in the json user input. For example, it can construct the missing cb entry for ER16515D7 based on the existing cb entry for ER77587, which it just takes from the user input json file, and it takes the year from psid.xlsx. I'm not sure its a better solution than just dropping those values and throwing a warning. It certainly isn't efficient. But it works pretty well for my use case.

I guess the biggest potential problem with this approach is that the coding of missing values could differ in the waves without a codebook entry compared with the waves with a codebook entry for a given variable.

Example user input json here examples/missing_cb_entries.json

aaowens commented 4 months ago

Looking at this again, it looks like I need to

  1. Download the updated famXXXX.zip files and update the hashes in the code
  2. Think about adding your code to reconstruct the missing codebook entries
economoser commented 1 month ago

Hi Andrew, just chiming in here to say that an update to include the latest PSID data would be awesome and that I'd personally be fine without those problematic variables. Thank you very much for your great work!

aaowens commented 1 month ago

Yes, I forgot about this. I made the changes on https://github.com/aaowens/PSID.jl/pull/48 and will merge it soon

aaowens commented 1 month ago

Should be good now, I tagged a new version