codeforboston / clean-slate-data

MIT License
27 stars 13 forks source link

Decide which Suffolk dataset to use for version 2 analysis #205

Closed agathaalmunir closed 2 years ago

agathaalmunir commented 2 years ago

Clarification between merged_x files in the processed folder vs cleaned_x files in the cleaned folder from Laura:

The merged_X files take the data from the 3 DA offices, and merge with the expungeability information we had. These have been minimally processed/cleaned -- we did things like stripping out non-alphanumeric characters, and created new variables (such as splitting out chapter/section/description into 3 columns from 1 original, or creating an indicator for whether the offense relates to CMR or not). They don't make judgement calls -- no corrections for impossible ages (<0) or implausible ages (<10), or whether to drop rows that are missing one of the key dates, how to code up "guilty", etc.

Those judgement calls are instead made in the analysis files -- things like the notebook How many expungable Middlesex Those judgement calls are also made in the Clean_datasets_for_visualizations notebook, which is what generates the clean_X data files. These clean_X files were created specifically to be an easier source for the 'visualizers' that Joel made, or for quicker sources for new analysis. The "How many expungeable" series should make most (??) of the same judgement calls as the "Clean datasets for visualizations" notebook, but I'm not entirely sure. They built on each other, and it looks like there hasn't been much updated in a while.

This issue of clean vs processed was an ongoing one -- when I was working on the project we'd inherited some definitions and folders and files, but they hadn't really been adhered to even at that point. And the issue of where to make these 'judgement calls' and then what to call the resulting dataset was never really resolved. I think the bigger issue was getting a smooth process from the "raw" files we got from the DAs into anything that could be used for analysis.

Anyways. You would most likely be safe using the clean_X files. You'd want to check that you still agree with all the assumptions made or variable definitions (particularly 'guilty', and maybe whether an incident is expunge-able or not, for example). To force yourself to check those things, you would start with the merged_X files and build from there.

I think the decision is mainly whether you want to do something ASAP (in which case, clean_X is probably good enough), or try to detangle the data pipeline first, or want to just build off of the existing "How many expungeable" notebooks, which use the merged_X as the key input file.

knod commented 2 years ago

~I'm not sure this document is finished, @seraph776 . Do you feel differently? Otherwise, if you no longer want to be assigned to this issue, you can remove yourself from it as opposed to closing it.~ Sorry, meant this for #201. I'm not sure whether this is completed or not.