Closed trevorcampbell closed 3 years ago
Completely agree about history.csv
and since it is a datasets arising from merging two of the raw datasets, we are free to choose the most informative name possible.
Not sure I agree about P0-46360_main.csv
since it is derived from a single FOIA request and I think the FOIA request number is the least ambiguous way to indicate this (and we have a table in the README
describing what each FOIA request contains. Alternatively we can try to come up with a more informative internal label for each FOAI request and use it consistently (at each step of the process) instead of the FOIA request number. If this is the case, we should add the "translation" between these two identifiers to the table in the README
. I am slightly worried about having two systems of unique identifiers for the same thing though...
The way I see the code in this repo is that it performs the transformation
[ uniquely identified FOIA requests but with overlapping / disorganized information] -> [mutually exclusive, semantically meaningful files with clear and useful linkage]
For example: if I'm a user of this dataset, I really don't care which FOIA request something came from. What I do care about is "what does this data file contain?" and "how can I relate the objects in this data file to this other one?". So I would expect us to produce files, one for each unique kind of "object" (officer, TRR, complaint, unit membership, etc) and have them named so that it's obvious what contains what.
[And so far it seems the code does that -- we have one file for our roster, one for complaints, one for unit membership records, one for TRRs, and each with unique IDs that help relate them all. But perhaps I've misunderstood the output files!]
I slightly disagree. To me there is a very important intermediate step, explained in the latex documentation, which consists solely in cleaning up the raw data and turning it into reasonable csv files. Everything we do after this involves some form of subjective judgement, but this step is quite uncontroversial and it can be useful on its own.
So if I try to summarize both your point of view and mine, it seems that a good solution could be
parsed
folder. But maybe using clean
would be betterlinked
, since it contains our own "semantic" interpretation of the data.I agree it was probably confusing of me to keep using the FOIA request numbers in the linked
folder. It was tempting because there is a pretty clear correspondence at the moment between FOIA requests and output files, but we could imagine this becoming more blurry in the future, if we start combining with other sources of data. So sticking to semantic filenames seems better in the long run.
I like this solution!
We need to rename some of these output files (
history.csv
should beunit_membership.csv
,P0-46360_main.csv
should betactical_response_reports.csv
, etc) to be more informative