markboots commented 7 years ago

Goal:

Create an open specification for exchange of "Flow Results".

Use-cases to be supported:

UNICEF U-Report: Dashboard of public polling results. Right now consumes directly from RapidPro using a one-to-one API. It would be valuable for U-Report to be able to consume data from any compliant system that publishes Flow Results.
Ona is being used for household surveys. Similar data is being collected via mobile channels (e.g. using InSTEDD Surveda). Want to aggregate that data together. Or there is an existing bespoke system that wants to consume data from more sources. Want to allow the user of that bespoke system to have flexibility on which tools are used for collection into that system.

2.1 Already have a “data consumer” system that is established and has institutional support

2.2 Other example: some countries want to have their own personalized internal system, since they don’t want the data collection tool to be the data repository system. They want to “own” the data repository… but have limited engineering resources to implement connectors to data collection tools. Now they only need to implement one interface for FLow Results, rather than one for each collection tool.
Ministries of health find more simplicity in hosting discussions if automated data transfer is easy. E.g.: Don’t want data sitting permanently in the cloud, but are OK with collecting in the cloud and storing temporarily if they
Integrating collected data with historical data. Somalia project: historical demographic data about demographic distribution. Getting live data on cash transfers in drought handling support. Integrating historical and live data is being done in an ad-hoc way. Unification and joining across geographic boundaries.

4.1 (Deduplication might be useful in other contexts (West africa health information systems BAA meetings… Two separate data sets for the same area with the same overlapping clinic info. Facility management is surveyed over and over again by different partners; a standard schema could help with dedup processing.)

Formats

The specification should be useful for representing flow results contained in files, but also for results data served over HTTP by platforms via APIs.

Scope / What will this cover?

We can imagine three levels of detail in Flow Results:

Flow Log / Level 1: This provides for full-power analysis of “how” a contact ran through a flow. It contains everything that happened during a flow: entry points and exit points of nodes, timestamps of these, and the raw inputs provided during these nodes.
Flow Results / Level 2: This provides a complete, non-lossy description of all the semantically meaningful results that were generated by a contact during a flow, with basic timestamps. It includes repeats of results generated by a contact going through the same node more than once. It allows analysis of semantically-meaningful "answers" provided by contacts, with the ability to visualize or filter these over time. Not all "Results" necessarily need to be "answers to questions"; in some cases they may be computed results.
User Export Results / Level 3: This is a potentially lossy reduction or simplification of Level 2 results, optimized for human understanding and ease of import into spreadsheets and other user-facing data processing tools. It could contain a multitude of simplifications or reductions of the data, depending on what an individual user wants to look at.

Within the decisions below we argue the layer that this project should tackle.

Decisions

A summary of decisions by the teams over our last 3 design calls:

markboots commented 7 years ago

Scope: Which level do we target?

Decision: Level 2, plus define a set of transformation options to Level 3

Rationale: It's possible to derive higher levels from lower levels: If you have Level 1, an algorithm can extract Level 2; likewise can extract Level 3 from Level 2. However, it's not possible to go the other direction due to lossiness, so we should pick the level that has enough data to support all target use-cases. Level 1 is more complex, and may have specifics tied to the implementation of each flow platform. It would require more coordination and changes to flow platforms. Level 3 might mean many different things/have many different variants depending on what a user wants; what works for some users might not work for others. Because Level 3 is lossy, some variants might not support the use-cases. Therefore we'll build the spec on Level 2 for the purpose of transferring between software tools, and define a set of transformation options that can be used to get what a user wants value on Level 3 from Level 2.

Example transformations from Level 2 to Level 3:

Flatten row results to columns (e.g. one column per "question"). Needs to provide options on how repeated questions for the same contact are handled.
Anonymize personal identifying info.
Preserve or remove timestamps
Provide table data as CSV (instead of Level 2 format)
Ordering of columns, if flattening to columns.

Specification basis: Use an existing standard?

Decision: Use Frictionaless Data Specifications’ Data Package specification

Rationale: Use an existing standard for data exchange between platforms, that provides sufficient flexibility for the control we need.

Data encoding: CSV, JSON, or other?

Decision: Use JSON row array format, instead of CSV:

[
  [ "Col A", "Col B", "Col C" ],
  [ 1, 2, 3 ],
  [ 4, 5, 6 ]
]

Rationale: Allows for a compact syntax (JSON array of arrays) while keeping the unambiguous JSON format and avoiding escaping/encoding of JSON blobs inside CSV (and accompanying CSV escaping headaches). Yaml advantages not applicable to this area (don't need manual editability of smaller documents) Using row array format instead of object format is efficient for avoiding repeated bytes in each row on large datasets.

Data location: Allow embedding data in the schema file?

Decision: Use Separate files or URLs for Schema def and data. Don’t allow inline data, even though the top-level Data Packages spec allows it

Rationale:

Keep parsing and handling apps simple, as they only need to handle one behaviour. . App implementation is simpler if you don’t have to handle two options (inline data and separate data)
Many times need to point from multiple result sets to the same schema
Valuable to exchange schema without data
Valuable to paginate data when retrieved via HTTP (big data sets)

How to represent "Open ended" question types:

Proposal A: Support the following data types:

Multiple choice - select one
Multiple choice - select many
Numeric
Open [Used for when the flow asked a “tell us anything” question of the user]
- Need to support different underlying data types: text, audio, video, picture
- Means each row would have it’s own JSON specifying type of answer for just this instance saying ‘this is a pic URI”
Text
Picture
Video
Audio
GPS location
Date
Time
Date time

Open question types require each row to provide the media type of that row, which could be:

Text
Audio
Picture
Video

markboots commented 7 years ago

Proposal B: As above, but Open type could define a media type at the schema level for all rows, to avoid having to define this on a row basis. Drawback is that it creates redundancy/synonyms between Open-Text and Text, or Open-Audio and Audio. Benefit is that it allows defining the semantic nature of the question asked to the user, separate from the byte data type of the result.

markboots commented 7 years ago

Supporting large transfers over http

Decision: Incremental pagination or etags: "Give me the next 500 results starting after this key" (etags, or bookmarks like twilio)

Rationale: Don't allow absolute page-based pagination that shares the total number of pages, as this becomes a performance liability on large-scale results.

markboots commented 7 years ago

Data row identifiers and keys for pagination

Decision: use a row_id column. This column must be unique within the dataset.

Rationale:

Cannot assume timestamp as that might break privacy constraints
Does not assume any type (string or number) as this might be implementation-specific
Does not assume monotonically increasing, although the platform serving the results obviously will need some sense of ordering for pagination to be ordered consistently.

markboots commented 7 years ago

Contact ID: What to use for the Contact ID column

Proposal A:

Just a string that needs to correlate across rows of results from the same contact.

Rationale:

Must allow transferring something that is not pre-established (arbitrary strings, eg as generated by anonymization / hashing techniques). Ed sees forcing a UUID here as unecessary ‘spec cruft’ - e.g. no value in Ona scenarios at hand, just creating throwaway values.

Proposal B:

Keep this free-form column, but also add a column that must be a 64-bit UUID

Rationale:

Performant for large datasets. Mark is saying performance benefits to having a 64bit integer num, and useful to correlate across platforms.
Possible alternative: have a separate spec for performant global contact ID exchange (e.g. something like twitter snowflake)

markboots commented 7 years ago

Non-functional Requirements:

OK to transfer 1MM rows+
Operate OK over crappy internet connections (pagination, etags, succeed in stages)
Can operate great with the Flow Definition specification, but can also be used independently without implementing the Flow Def spec. Means that semantic meaning must be self-contained.

markboots commented 7 years ago

Proposed schema definition example:

In addition to the required fields of the Data Packages specification, the schema would also contain a definition for the nature of each "question":

Schema {
    metadata: {
    specification_version: ‘0.3.1’,
    result_variables: [
        UUID1 {
            data_type: “multiple_choice”,
            label: “Are you male or female?”,
data_type_options: [
choices: [‘male’, ‘female’, ‘not identified’]
]
        },
        UUID2 {
            data_type: multiple_choice
            label: “Favorite ice cream flavor?”
            data_type_options: [
choices: [‘chocolate’, ‘vanilla’, ‘not identified’]
]               
        },
        UUID3 {
            data_type: numeric
            label: “How heavy are you?”
            data_type_options: [
range: [1, 500]
]               
        },
        UUID 4: {
            data_type: open,
            Label: “How are you feeling today?”
            data_type_options: [
media_type: text // depending on choice for open representation, this might only be in rows
]
        },
        UUID5: {
            data_type: open,
            Label: “How are you feeling today?”
            data_type_options: [
media_type: dynamic
]
        },
    ]
    row_count: 3832393,
}

markboots commented 7 years ago

Columns:

The columns in the data set will be:

timestamp: When the "Result" was received from the client. (Format: in UTC, in format: "2017-05-23 12:35:37")
row_id: unique string used for pagination/resume after. Must be unique in result set.
contact_id: unique string for identifying the contact that submitted
possible: contact_descriptor: additional info that would be useful in the project context
result_variable_id: matches the uuid from the result variables defined in the schema above.
result: the value given by the contact (e.g. "male")
result_metadata: JSON object of additional options that are needed depending on the question type. For a multiple choice question, could be the order the options were presented in. For an open question, is the media type of the response

This would be represented in JSON row array format, e.g.:

[
  [ "timestamp", "row_id", "contact_id", "result_variable_id", "result", "result_metadata" ],
  [ "2017-05-23 12:35:37", 20394823948, 923842093, "UUID1", "male", {“option_order”: “male,female”} ],
  [ "2017-05-23 12:35:47", 20394823950, 923842093, "UUID2", "chocolate", null ],
]

markboots commented 7 years ago

Questions:

Should we tackle contact parameter exchange? Or is that separate in scope?
Should we define a universal contact ID format for working between platforms? Does this have value to our use cases?

pld commented 7 years ago

I was pushing for the above on the call, and obviously there's value there, but after reconsidering I think it's out-of-scope

markboots commented 7 years ago

@edjez , @ewheeler , @pld , any comments or input before we shape this into a draft?

Small question: for the data in JSON array format, we know the columns would be fixed and always the same columns. Should we retain the header line, including repeating it for every page on paginated API results?

[
  [ "timestamp", "row_id", "contact_id", "result_variable_id", "result", "result_metadata" ],
  [ "2017-05-23 12:35:37", 20394823948, 923842093, "UUID1", "male", {“option_order”: “male,female”} ],
  [ "2017-05-23 12:35:47", 20394823950, 923842093, "UUID2", "chocolate", null ],
]

Or simply have this presumed:

[
  [ "2017-05-23 12:35:37", 20394823948, 923842093, "UUID1", "male", {“option_order”: “male,female”} ],
  [ "2017-05-23 12:35:47", 20394823950, 923842093, "UUID2", "chocolate", null ],
]

nicpottier commented 7 years ago

Just a quick butt in here but please include timezone and use something like ISO8601 dates. :) IE, something like "2017-06-19T14:17:33+00:00"

markboots commented 7 years ago

Thanks Nic, really appreciate your review and input!

We had discussed having all times/dates in here in UTC. (Logic: presentation into local timezone would be a responsibility at the app/view level. At the data exchange level between systems, the spec would be UTC.) Thoughts?

nicpottier commented 7 years ago

I would caution against that. Simplest example being birth dates which contain a time. Those have very specific meanings in a particular timezones which is not maintained if normalized to UTC and is potentially in different timezones per record.

If this is meant to be an interchange format then you can't throw that away without losing important information that is needed to act correctly downstream. (Any kind of scheduling based on result dates would be wrong for example across DST boundaries or if using relative offsets)

nicpottier commented 7 years ago

I think just saying dates need to be in iso format with tz offsets gets you all that. Implementations that don't care can output UTC but clients should all be able to deal with reading tz aware dates if sent along. That way the spec isn't forcing you to throw that data away.

pld commented 7 years ago

Including time-zones like that makes sense. About the header question, I don't have a strong opinion, on the pro side including the header makes it easier to process, on the con side it's more data

nicpottier commented 7 years ago

Well, while I'm at it, including headers in the same array seems super weird to me, we are returning JSON and will have other metadata such as paging information so seems like having that outside the actual "results" makes more sense. Excel jams that into the first row because they don't have the luxury of structure.

edjez commented 7 years ago

Would suggest one issue and thread per open topic/decision, so they can be commented independently.

edjez commented 7 years ago

Re. Timezone I vote for iso8601 with offset. the formatters/parsers for the time string deals with the actual day, so should be transparent. It would be lossy to remove the timezone metadata. and this is not (should not be!) a performance issue.

ggiraldez commented 7 years ago

Performance wise, the overhead of processing the timezone is completely irrelevant. Since in ISO8601 an offset is given instead of the timezone, it's just an arithmetic operation. No timezone database lookups are necessary.

edjez commented 7 years ago

Vote for keeping format JSON -compatible e.g. no extra layer of 'new lines' or stuff.

edjez commented 7 years ago

Vote for simplifying data types specs to have the primitives and just keep 'dynamic'

markboots commented 7 years ago

Recap of decisions on the call today:

Timezones and time/date formats:
- Rationale: 2017-06-19T14:17:33+05:30 keeps extra data, over requiring only UTC (potentially: the offset of the collector.) Note from Mark: I agree now there's no significant performance issue, as it's just an arithmetic offset (not a timezone like America/Regina that needs a DST lookup).
- Decision: ISO8601 with timezone extension
- Example:
  - 2017-06-19T00:02:00-04:00
  - 2017-06-19T00:02:00-05:00
Header rows:
- Rationale: it’s an incongruous structure to put into the "header" of the array. It’s also not necessary to include the headers, as they will be documented in the Schema file, and will always be the same 6 in the same order.
- Decision: don’t include headers in the data row arrays.
Question: will we enforce that each row will be its own line?
- A: Enforcing newline after each row?
- B: Requiring escaping of newlines within rows
- Rationale: stay within standard JSON, so we can use standard JSON parsers and streamers
- Decision: Just JSON, no additional constraints
Contact ID format:
- Promising need for a standard format, but not in the scope of this
- Decision: contact IDs are free-form
  - If we come up with a universal spec later, there can be a field in the schema to describe the format of contact IDs.
Open question type:
- Decision: Proposal A: Open question type needs to specify data_type in row-by-row result_metadata
- Rationale: simplicity of only one code path; no synonyms between Open-Picture and Picture.

nicpottier commented 7 years ago

Can we have at least milliseconds on datetimes?

markboots commented 7 years ago

Good question @nicpottier on milliseconds for timestamps. I don't have this in the initial spec draft but it would be a quick change. Can you share a bit the rationale and advantage you're thinking about? Should they be required, or optional?

nicpottier commented 7 years ago

Just second precision is losing an awful lot of data there on possible ordering of results etc.. We actually do microsecond these days, but millis would at least be a start. I think if we are doing ISO8601 then I think fractional seconds are actually part of that (and can be of arbitrary precision) so maybe just having the example represent at least millis would drive that home.

FLOIP / flow-results

Discuss and draft a concept specification #1

Goal:

Use-cases to be supported:

Formats

Scope / What will this cover?

Decisions

Scope: Which level do we target?

Example transformations from Level 2 to Level 3:

Specification basis: Use an existing standard?

Data encoding: CSV, JSON, or other?

Data location: Allow embedding data in the schema file?

How to represent "Open ended" question types:

Supporting large transfers over http

Data row identifiers and keys for pagination

Contact ID: What to use for the Contact ID column

Non-functional Requirements:

Proposed schema definition example:

Columns:

Questions: