Closed OlivierBinette closed 8 months ago
I pushed another tweak commit, can you take a look at that and see if it's acceptable?
@NickCrews Looks good! Thanks for catching and fixing the last issues.
@NickCrews Actually there are two potential issues:
I think the "gender" label is over the birth year data.
Oops, yes will fix.
For MSR.csv, how did you deal with birth dates that had "--" in them, like in "1861----"?
See the new docstrings
@OlivierBinette I was looking at the NBER website for the raw datasources. I'm curious, the CEN data you originally gave only has birth year and month, but the NBER codebook says it should also have birth day? Did you derive these CSVs right from those raw CSVs they provide for download, or were these leftover from some other research analysis you did? I would love to make it so that this data is reproducible right from the raw data sources, and am wondering if it would be possible to have a notebook that has the conversion code in it. Users wouldn't ever see it, but it is there for tracability.
@OlivierBinette take a look at the fixups I did from the original CSVs you had in your first commit (in the included notebook).
@NickCrews I can rework things and provide a provessing script.
Do we have a limit on dataset size, or should the dataset be put elsewhere if it ends up being too big?
That would be awesome if we had something reproducible! Thanks!
I mostly am concerned with the git history size. I think below 10Mb and we can just commit it. Larger than that and store separately? Idk maybe with git LFS? Never used it though.
I also am fine with storing data as parquet to get better compression, I don't think it needs to be human inspectable without tools. Thoughts there?
Created the separate PR #33 for RLData. I will reopen a PR for the union army dataset later.
Accidentally closed #24 ...
Addressed all comments except for this open one: https://github.com/NickCrews/mismo/pull/24#discussion_r1452646631