Filenames - Githubissues

trevorcampbell commented 3 years ago

We need to rename some of these output files (history.csv should be unit_membership.csv, P0-46360_main.csv should be tactical_response_reports.csv, etc) to be more informative

Thibauth commented 3 years ago

Completely agree about history.csv and since it is a datasets arising from merging two of the raw datasets, we are free to choose the most informative name possible.

Not sure I agree about P0-46360_main.csv since it is derived from a single FOIA request and I think the FOIA request number is the least ambiguous way to indicate this (and we have a table in the README describing what each FOIA request contains. Alternatively we can try to come up with a more informative internal label for each FOAI request and use it consistently (at each step of the process) instead of the FOIA request number. If this is the case, we should add the "translation" between these two identifiers to the table in the README. I am slightly worried about having two systems of unique identifiers for the same thing though...

trevorcampbell commented 3 years ago

The way I see the code in this repo is that it performs the transformation

[ uniquely identified FOIA requests but with overlapping / disorganized information] -> [mutually exclusive, semantically meaningful files with clear and useful linkage]

For example: if I'm a user of this dataset, I really don't care which FOIA request something came from. What I do care about is "what does this data file contain?" and "how can I relate the objects in this data file to this other one?". So I would expect us to produce files, one for each unique kind of "object" (officer, TRR, complaint, unit membership, etc) and have them named so that it's obvious what contains what.

[And so far it seems the code does that -- we have one file for our roster, one for complaints, one for unit membership records, one for TRRs, and each with unique IDs that help relate them all. But perhaps I've misunderstood the output files!]

Thibauth commented 3 years ago

I slightly disagree. To me there is a very important intermediate step, explained in the latex documentation, which consists solely in cleaning up the raw data and turning it into reasonable csv files. Everything we do after this involves some form of subjective judgement, but this step is quite uncontroversial and it can be useful on its own.

So if I try to summarize both your point of view and mine, it seems that a good solution could be

keep using the FOIA request numbers for filenames at this intermediate cleaning step. At the moment, the output of this step is places in the parsed folder. But maybe using clean would be better
start using "semantic" names for everything after this, that is for the folder currently called linked, since it contains our own "semantic" interpretation of the data.

I agree it was probably confusing of me to keep using the FOIA request numbers in the linked folder. It was tempting because there is a pretty clear correspondence at the moment between FOIA requests and output files, but we could imagine this becoming more blurry in the future, if we start combining with other sources of data. So sticking to semantic filenames seems better in the long run.

trevorcampbell commented 3 years ago

I like this solution!

chicago-police-violence / data

Filenames #4