Open srcole opened 7 years ago
Agreed, I think i've actually been using 'metric'.
On the csv files, it's worth noting that those are only committed to show some sample output. The actual data will be written to a postgres database. Either way putting them in a data subdirectory will keep things cleaner.
ah, ok. I wasn't aware that those csv files were just examples and not where we would actually pull the data from. I'm so amateur. Not familiar with postgres database, but now that I looked up what that is and look back on the repo more closely, I kinda see what you guys are up to now :).
So my current impression of the plan (please correct any of my mistakes) is that we will run the script run_scrapers.py
in scrapers/, which will execute the scraper for each metric (e.g. run_crime_scrapers.py
), which in that case will in turn execute the crime scraper for each city (e.g. run_socrata_scrapers.py
and run_sandiego_crime_scraper.py
). And each of these scrapers will run scrapers.src.PostgresUtils.UpdateTable
to update the appropriate table. Is that right?
I was hoping to contribute to organizing this crime data, but I'm not at all familiar with postgres or database management in general. Would it be useful if I made a script for each (non-Socrata) city that creates the csv to the standard I mentioned above? And then, I hope (?) that it would be straightforward for you or someone with knowledge of postgres to morph that script into one that will update the postgres database. Or would that be more trouble than it's worth?
In the script to make those .csvs, I am thinking of creating a standardized system of crime categories (something like the function categorize_crime
here. So that each city will have the same crime categories (as I mentioned on Slack).
I also worry that I'm possibly just way over my head here, so if that's the case, I apologize, and please let me know :)
Yep, that's about right. It will be easier to standardize this by changing the actual scraper though.
Will message you in Slack.
Currently, the location and filenames for each city's crime data is different. I agree with @MSilb7 in his pull request #56 that a good format would be: [city folder name]/data/[city]-[year]*-crime.csv
There are also some differences in the contents of the files (e.g. Seattle has 'crime-type' column name instead of 'category' like the others). Once again, drawing from the suggestion in pull request #56, I think we should go with: [Year, Month, Day, Category, Count].