ayyubibrahimi / us-post-data

MIT License
4 stars 1 forks source link

handling multiple rows for one officer #6

Closed ayyubibrahimi closed 1 month ago

ayyubibrahimi commented 3 months ago
tarakc02 commented 2 months ago

Do we have a preference for how to handle cases where a person's name changed during an otherwise contiguous stint? Here is an example from California:

    person_nbr      full_name                     agcy_name  start_date    end_date
23      230014  JOAN C ARTHUR  025: MULE CREEK STATE PRISON  2005-02-13  2006-04-30
24      230014   JOAN C SMITH  025: MULE CREEK STATE PRISON  2006-05-01         NaN

I feel like we should keep these as separate in order to facilitate lookups based on either name. In which case we flatten for each combo of person_nbr+full_name+agcy_name when stints are contiguous, rather than just person_nbr+agcy_name. Will implement that, keeping this comment here in case we want to revisit that decision.

tarakc02 commented 1 month ago

Hi @ayyubibrahimi, saw you were still waiting for this, so wanted to flag that I closed this issue 2 weeks ago, in this commit. However, I see that code has disappeared in the main branch, maybe swallowed up by some merge commit or reorg? Not sure. But I'll reopen for now. Next question is, which agencies do we apply this code to? I think we can just apply it to every agency that gets processed, should work as long as the table has the key fields person_nbr, full_name, agcy_name, start_date, and end_date.

ayyubibrahimi commented 1 month ago

Ah, weird, and sorry for missing this!

Yes, I think that we should apply this to every preprocessed script. There is currently a directory preprocess/preprocess/[state]/data/[output] that contains a table for every state that I've downloaded from the BLN repo in order to append columns like race.

I've added the links to each of these preprocessed tables to the normalize directory. If you cd into normalize/download there is a Makefile downloads each of these preprocessed csvs. Could you add the code for this task to that normalize directory?

Lmk if any of this doesn't make sense. Thank you :)

tarakc02 commented 1 month ago

can you take a look and see if ^ that looks right to you? I was able to run it locally and all files built. there's a bit of acrobatics dealing with unexpected dates. and implemented the logic you described to take the most recent value for other field values. do we filter anons upstream (#15 )? Or should that happen during normalize?

ayyubibrahimi commented 1 month ago

This looks right to me! I say we filter anons during the normalize stage.

tarakc02 commented 1 month ago

Great thanks! Closing with an added reference to #15, will take a look at that one now.