Closed markboots closed 6 years ago
Decision: Level 2, plus define a set of transformation options to Level 3
Rationale: It's possible to derive higher levels from lower levels: If you have Level 1, an algorithm can extract Level 2; likewise can extract Level 3 from Level 2. However, it's not possible to go the other direction due to lossiness, so we should pick the level that has enough data to support all target use-cases. Level 1 is more complex, and may have specifics tied to the implementation of each flow platform. It would require more coordination and changes to flow platforms. Level 3 might mean many different things/have many different variants depending on what a user wants; what works for some users might not work for others. Because Level 3 is lossy, some variants might not support the use-cases. Therefore we'll build the spec on Level 2 for the purpose of transferring between software tools, and define a set of transformation options that can be used to get what a user wants value on Level 3 from Level 2.
Flatten row results to columns (e.g. one column per "question"). Needs to provide options on how repeated questions for the same contact are handled.
Anonymize personal identifying info.
Preserve or remove timestamps
Provide table data as CSV (instead of Level 2 format)
Ordering of columns, if flattening to columns.
Decision: Use Frictionaless Data Specifications’ Data Package specification
Rationale: Use an existing standard for data exchange between platforms, that provides sufficient flexibility for the control we need.
Decision: Use JSON row array format, instead of CSV:
[
[ "Col A", "Col B", "Col C" ],
[ 1, 2, 3 ],
[ 4, 5, 6 ]
]
Rationale: Allows for a compact syntax (JSON array of arrays) while keeping the unambiguous JSON format and avoiding escaping/encoding of JSON blobs inside CSV (and accompanying CSV escaping headaches). Yaml advantages not applicable to this area (don't need manual editability of smaller documents) Using row array format instead of object format is efficient for avoiding repeated bytes in each row on large datasets.
Decision: Use Separate files or URLs for Schema def and data. Don’t allow inline data, even though the top-level Data Packages spec allows it
Rationale:
Proposal A: Support the following data types:
Open question types require each row to provide the media type of that row, which could be:
Proposal B: As above, but Open type could define a media type at the schema level for all rows, to avoid having to define this on a row basis. Drawback is that it creates redundancy/synonyms between Open-Text and Text, or Open-Audio and Audio. Benefit is that it allows defining the semantic nature of the question asked to the user, separate from the byte data type of the result.
Decision: Incremental pagination or etags: "Give me the next 500 results starting after this key" (etags, or bookmarks like twilio)
Rationale: Don't allow absolute page-based pagination that shares the total number of pages, as this becomes a performance liability on large-scale results.
Decision: use a row_id column. This column must be unique within the dataset.
Rationale:
Proposal A:
Rationale:
Proposal B:
Rationale:
In addition to the required fields of the Data Packages specification, the schema would also contain a definition for the nature of each "question":
Schema {
metadata: {
specification_version: ‘0.3.1’,
result_variables: [
UUID1 {
data_type: “multiple_choice”,
label: “Are you male or female?”,
data_type_options: [
choices: [‘male’, ‘female’, ‘not identified’]
]
},
UUID2 {
data_type: multiple_choice
label: “Favorite ice cream flavor?”
data_type_options: [
choices: [‘chocolate’, ‘vanilla’, ‘not identified’]
]
},
UUID3 {
data_type: numeric
label: “How heavy are you?”
data_type_options: [
range: [1, 500]
]
},
UUID 4: {
data_type: open,
Label: “How are you feeling today?”
data_type_options: [
media_type: text // depending on choice for open representation, this might only be in rows
]
},
UUID5: {
data_type: open,
Label: “How are you feeling today?”
data_type_options: [
media_type: dynamic
]
},
]
row_count: 3832393,
}
The columns in the data set will be:
This would be represented in JSON row array format, e.g.:
[
[ "timestamp", "row_id", "contact_id", "result_variable_id", "result", "result_metadata" ],
[ "2017-05-23 12:35:37", 20394823948, 923842093, "UUID1", "male", {“option_order”: “male,female”} ],
[ "2017-05-23 12:35:47", 20394823950, 923842093, "UUID2", "chocolate", null ],
]
I was pushing for the above on the call, and obviously there's value there, but after reconsidering I think it's out-of-scope
@edjez , @ewheeler , @pld , any comments or input before we shape this into a draft?
Small question: for the data in JSON array format, we know the columns would be fixed and always the same columns. Should we retain the header line, including repeating it for every page on paginated API results?
[
[ "timestamp", "row_id", "contact_id", "result_variable_id", "result", "result_metadata" ],
[ "2017-05-23 12:35:37", 20394823948, 923842093, "UUID1", "male", {“option_order”: “male,female”} ],
[ "2017-05-23 12:35:47", 20394823950, 923842093, "UUID2", "chocolate", null ],
]
Or simply have this presumed:
[
[ "2017-05-23 12:35:37", 20394823948, 923842093, "UUID1", "male", {“option_order”: “male,female”} ],
[ "2017-05-23 12:35:47", 20394823950, 923842093, "UUID2", "chocolate", null ],
]
Just a quick butt in here but please include timezone and use something like ISO8601 dates. :) IE, something like "2017-06-19T14:17:33+00:00"
Thanks Nic, really appreciate your review and input!
We had discussed having all times/dates in here in UTC. (Logic: presentation into local timezone would be a responsibility at the app/view level. At the data exchange level between systems, the spec would be UTC.) Thoughts?
I would caution against that. Simplest example being birth dates which contain a time. Those have very specific meanings in a particular timezones which is not maintained if normalized to UTC and is potentially in different timezones per record.
If this is meant to be an interchange format then you can't throw that away without losing important information that is needed to act correctly downstream. (Any kind of scheduling based on result dates would be wrong for example across DST boundaries or if using relative offsets)
I think just saying dates need to be in iso format with tz offsets gets you all that. Implementations that don't care can output UTC but clients should all be able to deal with reading tz aware dates if sent along. That way the spec isn't forcing you to throw that data away.
Including time-zones like that makes sense. About the header question, I don't have a strong opinion, on the pro side including the header makes it easier to process, on the con side it's more data
Well, while I'm at it, including headers in the same array seems super weird to me, we are returning JSON and will have other metadata such as paging information so seems like having that outside the actual "results" makes more sense. Excel jams that into the first row because they don't have the luxury of structure.
Would suggest one issue and thread per open topic/decision, so they can be commented independently.
Re. Timezone I vote for iso8601 with offset. the formatters/parsers for the time string deals with the actual day, so should be transparent. It would be lossy to remove the timezone metadata. and this is not (should not be!) a performance issue.
Performance wise, the overhead of processing the timezone is completely irrelevant. Since in ISO8601 an offset is given instead of the timezone, it's just an arithmetic operation. No timezone database lookups are necessary.
Vote for keeping format JSON -compatible e.g. no extra layer of 'new lines' or stuff.
Vote for simplifying data types specs to have the primitives and just keep 'dynamic'
Recap of decisions on the call today:
Timezones and time/date formats:
Rationale: 2017-06-19T14:17:33+05:30 keeps extra data, over requiring only UTC (potentially: the offset of the collector.) Note from Mark: I agree now there's no significant performance issue, as it's just an arithmetic offset (not a timezone like America/Regina that needs a DST lookup).
Decision: ISO8601 with timezone extension
Example:
2017-06-19T00:02:00-04:00
2017-06-19T00:02:00-05:00
Header rows:
Rationale: it’s an incongruous structure to put into the "header" of the array. It’s also not necessary to include the headers, as they will be documented in the Schema file, and will always be the same 6 in the same order.
Decision: don’t include headers in the data row arrays.
Question: will we enforce that each row will be its own line?
A: Enforcing newline after each row?
B: Requiring escaping of newlines within rows
Rationale: stay within standard JSON, so we can use standard JSON parsers and streamers
Decision: Just JSON, no additional constraints
Contact ID format:
Promising need for a standard format, but not in the scope of this
Decision: contact IDs are free-form
Open question type:
Decision: Proposal A: Open question type needs to specify data_type in row-by-row result_metadata
Rationale: simplicity of only one code path; no synonyms between Open-Picture and Picture.
Can we have at least milliseconds on datetimes?
Good question @nicpottier on milliseconds for timestamps. I don't have this in the initial spec draft but it would be a quick change. Can you share a bit the rationale and advantage you're thinking about? Should they be required, or optional?
Just second precision is losing an awful lot of data there on possible ordering of results etc.. We actually do microsecond these days, but millis would at least be a start. I think if we are doing ISO8601 then I think fractional seconds are actually part of that (and can be of arbitrary precision) so maybe just having the example represent at least millis would drive that home.
Goal:
Create an open specification for exchange of "Flow Results".
Use-cases to be supported:
UNICEF U-Report: Dashboard of public polling results. Right now consumes directly from RapidPro using a one-to-one API. It would be valuable for U-Report to be able to consume data from any compliant system that publishes Flow Results.
Ona is being used for household surveys. Similar data is being collected via mobile channels (e.g. using InSTEDD Surveda). Want to aggregate that data together. Or there is an existing bespoke system that wants to consume data from more sources. Want to allow the user of that bespoke system to have flexibility on which tools are used for collection into that system.
2.1 Already have a “data consumer” system that is established and has institutional support
2.2 Other example: some countries want to have their own personalized internal system, since they don’t want the data collection tool to be the data repository system. They want to “own” the data repository… but have limited engineering resources to implement connectors to data collection tools. Now they only need to implement one interface for FLow Results, rather than one for each collection tool.
Ministries of health find more simplicity in hosting discussions if automated data transfer is easy. E.g.: Don’t want data sitting permanently in the cloud, but are OK with collecting in the cloud and storing temporarily if they
Integrating collected data with historical data. Somalia project: historical demographic data about demographic distribution. Getting live data on cash transfers in drought handling support. Integrating historical and live data is being done in an ad-hoc way. Unification and joining across geographic boundaries.
4.1 (Deduplication might be useful in other contexts (West africa health information systems BAA meetings… Two separate data sets for the same area with the same overlapping clinic info. Facility management is surveyed over and over again by different partners; a standard schema could help with dedup processing.)
Formats
The specification should be useful for representing flow results contained in files, but also for results data served over HTTP by platforms via APIs.
Scope / What will this cover?
We can imagine three levels of detail in Flow Results:
Flow Log / Level 1: This provides for full-power analysis of “how” a contact ran through a flow. It contains everything that happened during a flow: entry points and exit points of nodes, timestamps of these, and the raw inputs provided during these nodes.
Flow Results / Level 2: This provides a complete, non-lossy description of all the semantically meaningful results that were generated by a contact during a flow, with basic timestamps. It includes repeats of results generated by a contact going through the same node more than once. It allows analysis of semantically-meaningful "answers" provided by contacts, with the ability to visualize or filter these over time. Not all "Results" necessarily need to be "answers to questions"; in some cases they may be computed results.
User Export Results / Level 3: This is a potentially lossy reduction or simplification of Level 2 results, optimized for human understanding and ease of import into spreadsheets and other user-facing data processing tools. It could contain a multitude of simplifications or reductions of the data, depending on what an individual user wants to look at.
Within the decisions below we argue the layer that this project should tackle.
Decisions
A summary of decisions by the teams over our last 3 design calls: