standardize organization of crime data

srcole commented 7 years ago

Currently, the location and filenames for each city's crime data is different. I agree with @MSilb7 in his pull request #56 that a good format would be: [city folder name]/data/[city]-[year]*-crime.csv

There are also some differences in the contents of the files (e.g. Seattle has 'crime-type' column name instead of 'category' like the others). Once again, drawing from the suggestion in pull request #56, I think we should go with: [Year, Month, Day, Category, Count].

devinaconley commented 7 years ago

Agreed, I think i've actually been using 'metric'.

On the csv files, it's worth noting that those are only committed to show some sample output. The actual data will be written to a postgres database. Either way putting them in a data subdirectory will keep things cleaner.

srcole commented 7 years ago

ah, ok. I wasn't aware that those csv files were just examples and not where we would actually pull the data from. I'm so amateur. Not familiar with postgres database, but now that I looked up what that is and look back on the repo more closely, I kinda see what you guys are up to now :).

So my current impression of the plan (please correct any of my mistakes) is that we will run the script run_scrapers.py in scrapers/, which will execute the scraper for each metric (e.g. run_crime_scrapers.py), which in that case will in turn execute the crime scraper for each city (e.g. run_socrata_scrapers.py and run_sandiego_crime_scraper.py). And each of these scrapers will run scrapers.src.PostgresUtils.UpdateTable to update the appropriate table. Is that right?

I was hoping to contribute to organizing this crime data, but I'm not at all familiar with postgres or database management in general. Would it be useful if I made a script for each (non-Socrata) city that creates the csv to the standard I mentioned above? And then, I hope (?) that it would be straightforward for you or someone with knowledge of postgres to morph that script into one that will update the postgres database. Or would that be more trouble than it's worth?

In the script to make those .csvs, I am thinking of creating a standardized system of crime categories (something like the function categorize_crime here. So that each city will have the same crime categories (as I mentioned on Slack).

I also worry that I'm possibly just way over my head here, so if that's the case, I apologize, and please let me know :)

devinaconley commented 7 years ago

Yep, that's about right. It will be easier to standardize this by changing the actual scraper though.

Will message you in Slack.

Data4Democracy / usa-dashboard

standardize organization of crime data #66