NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
14 stars 3 forks source link

Add RLdata and Union Army datasets #25

Closed OlivierBinette closed 8 months ago

OlivierBinette commented 10 months ago

Accidentally closed #24 ...

Addressed all comments except for this open one: https://github.com/NickCrews/mismo/pull/24#discussion_r1452646631

NickCrews commented 10 months ago

I pushed another tweak commit, can you take a look at that and see if it's acceptable?

OlivierBinette commented 10 months ago

@NickCrews Looks good! Thanks for catching and fixing the last issues.

OlivierBinette commented 10 months ago

@NickCrews Actually there are two potential issues:

  1. Could you check the column names in CEN.csv? I think the "gender" label is over the birth year data.
  2. For MSR.csv, how did you deal with birth dates that had "--" in them, like in "1861----"?
NickCrews commented 10 months ago

I think the "gender" label is over the birth year data.

Oops, yes will fix.

For MSR.csv, how did you deal with birth dates that had "--" in them, like in "1861----"?

See the new docstrings

@OlivierBinette I was looking at the NBER website for the raw datasources. I'm curious, the CEN data you originally gave only has birth year and month, but the NBER codebook says it should also have birth day? Did you derive these CSVs right from those raw CSVs they provide for download, or were these leftover from some other research analysis you did? I would love to make it so that this data is reproducible right from the raw data sources, and am wondering if it would be possible to have a notebook that has the conversion code in it. Users wouldn't ever see it, but it is there for tracability.

NickCrews commented 10 months ago

@OlivierBinette take a look at the fixups I did from the original CSVs you had in your first commit (in the included notebook).

OlivierBinette commented 10 months ago

@NickCrews I can rework things and provide a provessing script.

Do we have a limit on dataset size, or should the dataset be put elsewhere if it ends up being too big?

NickCrews commented 10 months ago

That would be awesome if we had something reproducible! Thanks!

I mostly am concerned with the git history size. I think below 10Mb and we can just commit it. Larger than that and store separately? Idk maybe with git LFS? Never used it though.

I also am fine with storing data as parquet to get better compression, I don't think it needs to be human inspectable without tools. Thoughts there?

OlivierBinette commented 8 months ago

Created the separate PR #33 for RLData. I will reopen a PR for the union army dataset later.