dat-ecosystem-archive / datproject-discussions

a repo for discussions and other non-code organizing stuff [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
65 stars 6 forks source link

pilot dataset/use cases #41

Open joehand opened 8 years ago

joehand commented 8 years ago

From @maxogden on July 16, 2013 18:46

describe and link to datasets that I can use as a way to pilot/test out dat!

via @loleg:

others:

Copied from original issue: maxogden/dat#12

joehand commented 8 years ago

From @schmod on July 24, 2013 22:38

The @unitedstates set of repositories seem to be attempting to accomplish many of dat's goals using only git and a selection of scraper scripts. The congress-legislators repository is a particularly good example, because it contains a list of scrapers/parsers that contribute to a YAML dataset that can be further refined by hand (or other scripts) before being committed.

I'm not a huge YAML evangelist, but it works exceptionally well in this case, because it's line-delimited, and produces readable diffs.

joehand commented 8 years ago

From @shawnbot on July 25, 2013 1:25

I'm super interested in the github one. There could even be separate tools for extracting data from the GitHub JSON API (paginating results and transforming them into more tabular structures), like: $ ditty commits shawnbot/aight | dat update (P.S. I call dibs on the name "ditty").

The use case that I'm most interested in, though, is Crimespotting. There are two slightly different processes for Oakland and San Francisco, both of which run daily because that's as often as the data changes:

Oakland

  1. Read a single sheet from an Excel file at a URL
  2. Fix broken dates and discard unfixable ones
  3. Map Oakland's report types to Crimespotting's (FBI-designated) types
  4. Geocode addresses (caching lookups as you go)
  5. Update the database

San Francisco

  1. Read in one of many available report listings from a URL:
    • rolling 90-day Shapefile
    • 1-day delta in GeoJSON
    • all reports since January 1, 2008 packaged as a zip of yearly CSVs
  2. Map SF's report types to Crimespotting's (FBI-designated) types
  3. Update the database

Updating the database is the trickiest part right now, for both cities. When @migurski originally built Oakland Crimespotting, the process was to bundle up reports by day and replace per-day chunks in the database (MySQL). We ended up using the same system for San Francisco, but it doesn't scale well to backfilling the database with reports from 2008 to present day, which requires about ~2,000 POST requests.

My wish is that dat can figure out the diff when you send it updates, and generate a database-agnostic "patch" (which I also mentioned here) that notes which records need to be inserted or updated. These could easily be translated into INSERT and UPDATE queries and piped directly to the database, or collections to be POSTed and PUT to API endpoints.

Here's how I would love to be able to update San Francisco:

# San Francisco: download the 90-day rolling reports shapefile
$ curl http://apps.sfgov.org/.../CrimeIncidentShape_90-All.zip > reports.shp.zip
# an imaginary `shp2csv` utility would convert a shapefile to CSV,
# `update-report-types` would convert the report types into the ones we know about,
# then `dat update` would read CSV on stdin and produce a diff as JSON
$ shp2csv reports.shp.zip \
  | update-report-types \
  | dat update --input-format csv --diff diff.json
# if our diff is an object with an array of "updates" and "inserts",
# we can grab both of those using jq and send them to the server
# (ideally with at least HTTP Basic auth)
$ upload=curl --upload-file - -H "Content-Type: application/json"
$ jq -M ".updates" diff.json | $upload --request POST "http://sf.crime.org/reports"
$ jq -M ".inserts" diff.json | $upload --request PUT "http://sf.crime.org/reports"

:trollface:

joehand commented 8 years ago

From @msenateatplos on August 1, 2013 20:29

There are lots of academic research data repositories out there, this is one open access service with quite a bit: http://www.datadryad.org/ approximately 10,579 data files.

There are more, especially lots of small repositories, here are some lists: general http://databib.org/, general http://oad.simmons.edu/oadwiki/Data_repositories, health http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html, various other (social site) http://figshare.com/.

I'll start a separate issue about WikiData...

joehand commented 8 years ago

From @sballesteros on August 8, 2013 22:40

We are going to start a significant data gathering process very soon for communicable diseases circulating in the USA (first starting with NYC). Code will live here.

Unfortunately, the data is currently really dispersed and mostly lives in PDFs and scans. Health agencies or the CDC typically report communicable diseases on a weekly or monthly basis. After each update a lot of analysis have to be re run so a tool like dat would help.

Our plan is to convert this dispersed data into a database. We are going to have to implement some transformation modules along the way, so it would be great to share our effort with dat. We will work with node.js and mongoDB.

In essence we will have a primary collection containing atomic JSON documents (at the row level) for each data entry; and we will implement SLEEP as another history collection tracking the transaction log of changes.

joehand commented 8 years ago

From @jkriss on August 13, 2013 15:55

There's a lot of data published and/or indexed by JPL that could benefit from dat. For instance, there's a portal for CO2 data. In this case, the data is coming in from multiple sources, and the end result of a search is just a download link.

There's also a really interesting visual browser with maps and scatterplots, but you can't currently download the subset of data you find with that tool.

I may be working with this group to create the next iteration of the data portal, so I'll probably be able to learn more about it and suggest dat or dat-like approaches.

joehand commented 8 years ago

From @IronBridge on August 21, 2013 12:34

Traditional ETL Replacement For several years, we have been using traditional ETL tools like Pentaho to take large (4-10 gigs) delimited data sets and transform them into other data formats such as JSON, database inserts, RESTFUL web service calls, import into big data infrastructures like Hadoop.

Our use case is unique because we may take a large CSV file, transform the data, and then load into multiple repositories. For example, this could be a direct database insert on one end and also an HTTP post. It will be important for us to have a mechanism to determine if records in a batch have failed, and which records.

Node.js steams, pipes, and the fact it's JavaScript makes a very intriguing replacement for a traditional ETL tool.