Investigate Data Cleaning Requirements

georgelu commented 8 years ago

I'll take a look at the data and sketch out some key fields by tomorrow evening.

For instance: structure recurring events, handling of one-off/special events, other requirements.

ashander commented 8 years ago

OK, I think this references #4

I understand the broad issue, but not the details. Cleaning the data in a one-off way makes sense for this prototype.

For longer-term sustainability of this idea, it'd be good to think about how folks from non-profit community or similar on the ground will be able to maintain/alter data that feeds into the app.

Perhaps if they provided a little more structure in their data on operating hours, we could provide a reusable solution to pipe data from csv to the schema?

ashander commented 8 years ago

It seems like this also involves a discussion of overall design. @georgelu maybe this issue could serve as a stub to have that discussion?

( I had just created another issue #23 for discussing the data flow, but I think it's better to just have that discussion here.)

Main goals:

useful to target audiences
sustainable product
ideal: avoid data cleaning over and above the data stakeholder/partners can provide

ashander commented 8 years ago

As an alternative to cleaning and preprocessing, it might be possible to do everything in-browser from a CSV.

For example, using http://papaparse.com/ but this would require writing javascript to do all the steps outlined in the data cleaning section of the readme

georgelu commented 8 years ago

Typical cases:

Xth weekday (ex: second Thursday)
Everyday (ex: everyday except Sunday)
One-off events (ex: Easter only)

Atypical cases:

Every Thursday unless it is a 5th Thursday
Last two mondays of every month (this may mean the 4th and 5th Monday or the 3rd and 4th Monday)

Hypothetical cases: (which are likely not worth supporting right now)

Xth day of every month (ex: 15th of a month)
Non-week based recurrence logic (ex: rainy days, hot days)

Desirable Data: (all should be both human and machine readable, and one field may require two JSON rows)

Exact Date
Weekday
Recurrence logic
Neighborhood
Address
Service Type
Requirements
Starting hour
Duration
Phone Number
Link to official website
Link to social media profiles?

Optional data:

Type of Service - Coloring logic (I can color-code styling/borders based on Type of Service as seen in the current index demo)
Neighborhood - Coloring logic
Event Unique ID (to make controlling modals or making other API calls easier)
Series ID (to easily associate future instances of the same re-occuring event)
Human readable time logic for all of a site's events (this may involve unwieldly tables of days/hours)
Brief human readable description

georgelu commented 8 years ago

To clarify, by pre-processing, do you mean manual cleaning/processing?

One thing to keep in mind is that data must be both human and machine readable. For instance, we want to cleanly display a site's address, but the Google map API may work best with coordinates or other non-trivial conversions. Another example: dates need to be easily comparable and sortable while sometimes complex recurrence logic needs to be clearly explained to users.

Based on my relative skill with JS/Python, I'd favor adding more fields to the JSON output rather than do further procession on the browser side. I'm not sure about the precise technical tradeoffs and can happily try to work either way.

ashander commented 8 years ago

great stuff. yes, I meant avoiding manual processing. overall, I agree it makes sense to do processing with Python to clean json. Pushing forward on that will help us to see if there's some additional structure on the human readable side (be that wiki, google doc, or spreadsheet) that could make our task of using it programmatically easier. more tomorrow, - Jaime

ashander commented 8 years ago

wxdatetime classes have some pretty powerful parsers capable of what we need. eg, lhttp://docs.wxwidgets.org/trunk/classwx_date_time.html#a4687372ebe55a6aded83de6a639cde95

georgelu / directory

Investigate Data Cleaning Requirements #8