Flow Results and RapidPro version management

FLOIP / flow-results

Open specification for the exchange of "Results" data generated by mobile platforms using the "Flow" paradigm

6 stars 2 forks source link

Flow Results and RapidPro version management #34

Open markboots opened 5 years ago

markboots commented 5 years ago

To determine how RapidPro can support Flow Results, particularly including how to manage results against multiple versions of flows.

Reference on the Flow Interop side: two options proposed for how to manage versions:

https://floip.gitbooks.io/flow-results-specification/content/specification.html#results-versioning

@nicpottier , can you add the background on RapidPro result versioning, and explain the challenge?

Thanks!

nicpottier commented 5 years ago

A thousand million apologies for the delay on this.

Quick glossary, in RapidPro we call different versions of a flow "revisions". Version is reserved for actual schema changes to the flow specification as opposed to changes to a particular flow. So when I say revision below think FLOIP version.

So RapidPro has a slightly different view of results than what is captured here. For us, results are basically a key/value map. Flows can have hundreds of revisions and a contact can interact with multiple revisions of a flow (this is actually quite common), so we don't have a concept really of the results "for a revision".

Perhaps we could use option 1 of result versioning, by essentially saying that the package id is the combination of the flow UUID and revision number, but that seems really limiting for most use cases, especially given that flow revisions change constantly. (moving a node creates a new revision for example, so even changes which don't affect result formats create new revisions)

Option 2 really doesn't map to the reality of RapidPro results since results span multiple flow revisions.

RapidPro could basically ignore the parameters for Option 2 and include all results we have, but I'm worried about the caller dealing with question IDs that don't exist in questions. (because new questions have been added for example) I guess that's one question, is does the contract allow the API to return EXTRA responses to questions not available in the resource data? To me that would match the lessons learned by self descriptive formats such as JSON and XML that allowing extra data in API responses is desirable.

So maybe there needs to be an Option 3 which says, you get all results that exist and it is up to you as the caller to interpret those results based on the version of the schema you have? Would that be a possibility?

markboots commented 5 years ago

Thanks a lot Nic! Sorry now for my delayed response. Still thinking about what the most useful solution would be for FR users.

I agree that Option 1 is really limiting. Presenting results across many RapidPro revisions is the useful thing.

Option 3 might be possible, although right now the caller doesn't request Option 1 or Option 2 -- it's the server that decides what to implement. From a caller perspective, it would be nice to count on there being a question in the schema for each row returned... Ensuring you can interpret all the data, or don't need to filter out data you can't interpret.

Wondering if:

1 - In the RapidPro results map, do you associate results with the revision that generated them? Do you have the ability to filter results by revision?

2 - When rendering the FR schema for Flow Revision D, could you include all questions that might have been possible in Revision A, B, C, and D? (Hits the goal of including all results, and caller can interpret the meaning of all results.) This might work if you render the schema from the union of all results in your result map to-date, or to-latest-version (assuming you have a performant way of getting all results keys).

3 - Less ideal, could you filter out (on your server side) results that don't match a question in the current schema?

From the current spec:

In this case, the response data includes responses collected under multiple versions. API access may implement the filter parameters min-version and max-version to allow clients to selectively retrieve responses from specific versions. (If a client has cached a version of the schema from a Package descriptor, it is recommended to supply the Package's modified descriptor as the max-version when querying the API for responses, to ensure it does not receive responses from newer versions without a corresponding question in the cached schema.)

The "may" means a server implementation doesn't have to implement version filtering if that's not possible. However, then we need to figure out how a caller deals with result rows it doesn't recognize, from its version of the schema.

In the Viamo implementation of Flow Results, we include in the schema for Version/Revision D all of the questions that might have appeared in Revision A - D. This way, we guarantee that questions are never removed in a later FR schema version (Option 2 requirement), and we can provide all results ever gathered (unless limited by the min-version and max-version filters).

nicpottier commented 5 years ago

No, results are not tied to revisions in any way. If you collect a field named age then that's just a field that exists and it is up to you to interpret it.
Not really, no. We do keep revisions around, but there are thousands of them on some flows and pruning these is on the roadmap.
This we could do fairly easily, only spitting out rows that map to questions in the current schema. But I don't this this addresses callers with an older schema and having it encounter answers that it doesn't know about.

Seems like just having the contract that the API CAN return extra data and it is up to the caller to interpret / ignore those if it wants (especially in light of the may when it comes to revision filtering) would make this fairly straightforward.

markboots commented 5 years ago

OK, thanks for that info. Because RapidPro isn't associating results with revision numbers, and doesn't plan to retain all revisions (and assuming you aren't considering adding result->revision tracking?):

Possibilities available:

We should clarify in the spec what to do if a server doesn't/can't implement min-version and max-version filters (e.g. return 501 Not Implemented if it gets a request with these filter params)
What about this idea?
1. If you get a max-version filter, use that revision of the flow to generate the schema. (If you don't have that revision because its been pruned, return some appropriate error.) Then filter out rows that don't have a question in that schema? This allows a client to ask for results that it can interpret based on the schema it has.

More substantial changes to the spec would be:

An option for clients to ask for all result rows available, even if they don't match the schema.

One of the main goals of Flow Results is to carry the ability to interpret results along with those results. To avoid breaking this, I'd prefer that clients need to explicitly ask for non-interpretable result rows, rather than get them by default. A client can always ask for the latest schema first, and then query results with this schema as max-version, to get all result rows. (*Not quite: with your implementation, it wouldn't get result rows for "deleted" questions (e.g. RP result keys that don't exist in the current RP flow).

@pld @nditada , do you have thoughts here on what you'd prefer as an API client?

nicpottier commented 5 years ago

I guess I don't understand the resistance to having clients deal with extra data gracefully and making that part of the spec. Seems like that's the big learning from JSON/XML APIs vs binary formats, that having extra data and being able to ignore it is a fundamental property to making non-fragile interoperability happen.

What am I missing there?

markboots commented 5 years ago

My rationale is that it comes from the purpose of Flow Results: the problem that its solving is carrying the context needed to interpret results along with those results. That's where FR goes beyond (e.g.) a CSV of text results that all of our apps can export now.

I assume that a typical Flow Results client will be displaying results in a dashboard, or doing some kind of analysis on them, that depends on the semantic context in the schema. FR aims to make this easy, in a general way, without needing to parse a full flow description. The common case for clients would be to want to ignore rows that you can't analyze. (Clients with a purpose of archiving/data-warehousing would be the exception.)

Therefore I prefer a default where clients have to ask for non-interpretable rows, rather than requiring all clients to potentially deal with rows that "break" the schema. That said, eager to hear from other perspectives -- what do others think, and what would be your expected use-case to optimize for?

ewheeler commented 5 years ago

I think that a Flow Results endpoint needs to return all of the relevant flow results by default, especially the non-interpretable rows. In a lot of situations, the overall summary statistics will be meaningful regardless of the specific contents of any particular flow result row. If the default serves only a subset of a flow's results, there is a huge risk of people misinterpreting these results since the various revisions don't necessarily convey any meaningful differences.

We have to assume that a flow's semantic meaning remains unchanged across all flow revisions. If authors/editors are introducing major semantic changes within their flows, they should configure separately as a new flow or make a custom client to interpret results across all their revisions. If users are making small not-so-meaningful changes to their flows, they should get all of the results of all revisions together. Only the flow author and editor can make this determination of semantic change--and unless all FR-supporting tools add functionality for explicit, semantic flow versioning performed by users, then we need to strongly reinforce the fact that preserving the ability to reliably interpret flow results is solely the flow author/editor's responsibility.

If flow authors have not thought about their analysis needs while revising flows, then we need a model that can encourage better design & data practices instead of discarding/ignoring data that could greatly distort interpretations of flow results. The flexibility and adaptability of the flow paradigm is a feature--not a bug--that enables iterative design approaches and adaptive management of programmes with very little friction. Naturally, a flow's results will embody these patterns. So why would we treat some subset of a flow's results as 'correct' and make others invisible?

If one of the main goals of Flow Results is to include the ability to interpret results along with those results, then we must include all of these results. Some of our users will have to think critically and improve their flow design and flow management practices before they are able to meaningfully interpret their results, but we have to give them a full view in all of its messy and incompatible splendor in order to drive improved and responsible use of these tools.

nicpottier commented 5 years ago

Tend to agree with Evan there.

On another related note, I think it is probably really dangerous if we are seeking any kind of real interop to have optional features that the server can or cannot satisfy. That creates not one spec but n specs where n is the n! of the number of optional features because invariably clients will build to the use case they want against the single server endpoint they care about at that point in time but by doing so not create something that will work universally.

So whatever direction we go here I would very strongly prefer that any optionality in the server behavior be removed. Rather clients should be able to forgiving / all encompassing if there are some servers that return things and others that don't.

markboots commented 5 years ago

Thanks Evan for weighing in; as users that have thought a lot about dashboarding and archiving situations, it's helpful to have that perspective in here.

There's something I don't understand in your response, and I think a quick call with you and Nic could help. I'll send an invitation for next week; would be great if @pld or Matt could also join, in their perspective as RP / FR data consumers.

What I didn't understand is your point that retaining the semantic meaning across versions should be a planning responsibility of the person authoring/changing the flow:

If flow authors have not thought about their analysis needs while revising flows, then we need a model that can encourage better design & data practices instead of discarding/ignoring data that could greatly distort interpretations of flow results.

I understood we were tackling a limitation in RapidPro here, rather than a lack of planning by the person making a new revision. Here's a practical example:

Publish a revision of a flow with 10 "questions" and collect significant data on it.
Realize that one of these questions is no longer a priority, and remove Question 7 in a new revision. Collect more data on the new revision.
Client requests a FR schema and data to display these results in a dashboard.

In a FR implementation that is compliant with the current spec, the most recent schema still includes all 10 questions (unless requesting revision/version limits). Because RapidPro doesn't guarantee keeping old flow revisions around, and because it doesn't associate results with revisions, it can't handle this, and therefore the latest schema served by RapidPro would only have 9 questions. This loses the semantic meaning of Question 7. Is there a way this could be avoided through better planning/thoughtfulness from the person revising the flow?

Back to the start: if a FR server can't provide semantic meaning for a result row, should it provide the result row at all? If so, how should that result row be formatted? The goal of FR was that it allows interpreting the meaning for all data, in a self-contained way.

I thought that @ewheeler had a compelling point that the total number of interactions is valuable data. We could get there, and keep the ability to interpret all of them, if RapidPro could generate a 'union' schema across all revisions.

In the other implementations of Flow Results this isn't an issue. E.g. in the Viamo implementation, because we store past revisions, we include all "questions" in the schema that correspond to the revision/version range requested. Then all rows in the result set are interpretable.

On another related note, I think it is probably really dangerous if we are seeking any kind of real interop to have optional features that the server can or cannot satisfy.

Based on this, would you be up for finding a way to providing interpretation for all the rows across all revisions? This avoids the first deviation from the current spec, and gets rid of optionality. For maximizing value to FR consumers, it's also the most helpful: you get all results, can get accurate totals, and interpret them all. This hits Evan's suggestion of "then we need a model that can encourage better design & data practices"

nicpottier commented 5 years ago

So yes, we could add revisions to results, though can't necessarily guarantee we would keep those around forever, our revisions are also very very verbose, so I don't think having every revision would actually be something most people would want.

More importantly, we have 500M results that DON'T have revision information around them, so we would still want a way to have those results available via FR, so any solution that alienates those is a failure in my view.

So had a though this morning that I'd like some serious consideration of, which I think addresses most of the concerns here clearly and easily while also simplifying the proposed API.

Part of the sticking point here is this idea of versioning of results, and getting results for only specific versions etc.. Seeing how there is disagreement on the importance and handling of this, this feels a lot like something which does not belong as a piece of the FR API. While it MAY be important to some users, I don't think it is a fundamental aspect of the value of a standardized API for getting results. More importantly, I don't think we have to lose the ability of having that kind of specificity for backends that chose to expose it.

So what if we instead said:

FR has no concept of versions, just flow "ids"
FR already has a concept of getting a listing of flows available for an authorized user. Backends may chose to expose flow versions within this endpoint. IE, if a backend feels that versioning is useful to its users and wants to differentiate results that way, it could return a list of flows (why are these called packages btw?) Note the -v suffix here for the second and third result.

{
            "type": "packages",
            "id": "0c364ee1-0305-42ad-9fc9-2ec5a80c55fa",
            "attributes": {
                "title": "Standard Test Survey (All Versions)",
                "name": "standard_test_survey",
                "created": "2015-11-26 02:59:24+00:00",
                "modified": "2017-12-04 15:54:44+00:00"
            }
},
{
            "type": "packages",
            "id": "0c364ee1-0305-42ad-9fc9-2ec5a80c55fa-v13",
            "attributes": {
                "title": "Standard Test Survey (v13)",
                "name": "standard_test_survey_v13",
                "created": "2015-11-26 02:59:24+00:00",
                "modified": "2017-12-04 15:54:44+00:00"
            }
},
{
            "type": "packages",
            "id": "0c364ee1-0305-42ad-9fc9-2ec5a80c55fa-v1-13",
            "attributes": {
                "title": "Standard Test Survey (v1-13)",
                "name": "standard_test_survey_v1_13",
                "created": "2015-11-26 02:59:24+00:00",
                "modified": "2017-12-04 15:54:44+00:00"
            }
}
}

Clients must be written to be deal with getting results that are not in the schema. (and either throwing them out, or representing them generically, taking guesses, whatever, that's up to the client)

This has a few big advantages in my opinion.

It removes the concept of versioning as a fundamental aspect that needs agreement on across providers. This is not something that should be in the spec anyways in my opinion, the big gain here is just having a standardized way of getting results, not versioning.
It allows backends to make decisions about what versions to expose. Revisions in RapidPro for example are lightweight and made automatically as someone edits a flow, they may represent something meaningful, such as adding or removing a question, or they may represent something benign such as moving a node around. (or it could be something in between, such as tweaking the wording on a question) In any case, using either modified_on or revision is too coarse a way of determining whether there is a real change. If we decided in the future that we wanted to expose versioned results in the future, we could add only those revisions that represent meaningful boundaries in the flow versions to the flow list endpoint.
It makes the simple / common path of the API super easy and clear, with graceful and powerful semantics available to backends if they want to expose versioning in some way. No optional features for backends to implement, no optional features for clients to consider.
It makes the job of clients easier. There is then only one API, no optional parameters, no need to think about versioning at all really. If you imagine the API being turned into a UI of a drop down that the user is using, then it is similarly easy to imagine how with the exact same API and implementation, different backends could easily expose different options for the user (or not), but the client really doesn't need to know about this.

Thoughts?

@ewheeler @rowanseymour

rowanseymour commented 5 years ago

Nothing much to add other than I'd concur that it's been our experience that users consider a flow to be a fairly fixed set of questions, and when that set of questions needs to change, so does the flow. Now that might be in part because they don't much choice in RapidPro but we've not had complaints that I can recall about this. It's certainly a nice simplification elsewhere to say that a flow has a single schema.

markboots commented 5 years ago

Hi @nicpottier , thanks for the call last week. I've cleaned up my notes from the call, including a summary of the solution/spec changes I think we arrived at. What do you think of this?

Implementations are free to determine how changes to flows are represented.
1. For example, new versions of flows might be published as new distinct package IDs, or by publishing results from different versions of flows under the same package ID, or a combination of these approaches. The Flow Results specification leaves decisions on version handling up to implementations, within the following constraints:
Schemas for a single package ID may evolve over time, but they must enable interpreting all results within that package.
1. For example, this typically means that changes to a schema must be "expansive-only": an updated schema must still include questions that were deleted in the latest version, if results have been collected on those questions.
Clients must be prepared to receive results from an updated schema, and request the latest schema if they need to interpret these.
1. For example:
  1. a client has cached the schema for a package,
  2. a question is added in a newer version of the Flow, and more results are collected.
  3. The client requests the latest results. It could find result rows with question IDs that do not exist in its cached version of the schema. It must be prepared to re-fetch the updated schema for the package ID to interpret these unknown question IDs.
Implementations may (should?) provide the "deleted" attribute on questions, to indicate that a question in the schema has been removed from the current flow, but is retained in the schema to interpret older results.
When an implementation choses to group different versions of flows under a single package ID, it may expose the 'version' attribute on packages, in the package listing. 'version' can be any string, but versions within the same package ID must to be sortable and monotonically increasing.
1. (Nic, this is meant to take your idea in the response above, but moving the "-v..." from the end of the ID to a separate attribute. This avoids breaking the UUIDv4 format for IDs.)
When an implementation choses to group different versions/revisions of flows under a single package ID, and supports filtering results by version, it must use the min-version and max-version keys as the filter parameters in https://floip.gitbooks.io/flow-results-specification/content/api-specification.html#get-responses-for-a-package. (Implementations are not required to support version filtering; if version filtering is not supported, these filters parameters are ignored.)

With these changes, we would remove the 'modified' attribute from packages, and introduce the 'deleted' boolean attribute on questions in the schema. There would be no more "Option 1" and "Option 2" in results versioning.

Does this capture where we arrived at together on last week's call, and does it seem feasible from the RapidPro side? Hope this gets close to meeting all of our needs!

markboots commented 5 years ago

@pld , @nditada , any thoughts on this, as far as changes from your existing implementations? I'll write up more clear spec changes if we get a 👍 from @nicpottier .

If you implemented Option 1 for version handling (no version handling), then I don't think there's any changes needed on your side.
If you implemented Option 2, the only change needed to keep up with the spec would be to replace the 'modified' attribute on packages with the 'version' attribute.

This relaxes the rules for implementations around version handling, but existing implementations of either the previous Option 1 or Option 2 still work, with that minor change.

pld commented 5 years ago

Hi @markboots, I don't see a problem with supporting this, in fact it relates to some challenges we're addressing in ingesting OnaData into Canopy.

About the "expansive-only" schemas mentioned in (2.), does this apply to changes in schema type as well? For example, if the edit of a flow is to change a text column to a numeric column, would we represent this as marking the current text column as deleted (keeping its name the same) and creating a new numeric column with the same title as the now deleted text column but a different name? Reasoning being someone might write visualizations to interpret the numeric only column that would break if they operated on the historical text data, or reversing the change from existing numeric to text, an existing visualization that works only with numeric data would now break with new data.

NB @ukanga who knows more about the OnaData implementation of FLOIP

markboots commented 5 years ago

Hey @pld , thanks for this!

About the "expansive-only" schemas mentioned in (2.), does this apply to changes in schema type as well? For example, if the edit of a flow is to change a text column to a numeric column, would we represent this as marking the current text column as deleted (keeping its name the same) and creating a new numeric column with the same title as the now deleted text column but a different name? Reasoning being someone might write visualizations to interpret the numeric only column that would break if they operated on the historical text data, or reversing the change from existing numeric to text, an existing visualization that works only with numeric data would now break with new data.

I'm in agreement with that. I think the most concise definition is, "the schema must allow interpreting all the results in the package". Implementations could have some freedom as long as that is maintained. So in your example,

If the switch is from numeric to text: any old (numeric) answers would still valid under the new schema. You could keep a single question ID, update the schema, and keep all results under the original question ID.
If the switch is from text to numeric: some old answers would not be valid under the new question type. As you said, you could introduce a new question ID with the numeric type, post new results under that question ID, and have an old question ID (deleted) for the original results.
Any other approach that leaves the schema able to fully+consistently interpret the results is also OK.

markboots commented 5 years ago

If the switch is from numeric to text: any old (numeric) answers would still valid under the new schema. You could keep a single question ID, update the schema, and keep all results under the original question ID.

Maybe I missed something from your example, thinking from the client side. It might be a problem for a client that built a dashboard expecting numeric responses, to have that question type change to text. The client's bucket for those responses wouldn't be prepared for text.

Might be safer for clients if a new question ID was created, and the old question ID was marked as deleted, and contained all the old results? Thoughts?

pld commented 5 years ago

If the switch is from numeric to text: any old (numeric) answers would still valid under the new schema. You could keep a single question ID, update the schema, and keep all results under the original question ID.

Maybe I missed something from your example, thinking from the client side. It might be a problem for a client that built a dashboard expecting numeric responses, to have that question type change to text. The client's bucket for those responses wouldn't be prepared for text.

Might be safer for clients if a new question ID was created, and the old question ID was marked as deleted, and contained all the old results? Thoughts?

Nothing missing, I agree with this.

nicpottier commented 5 years ago

Mostly looks good to me. Still don't think versioning belongs in any way in the spec. I would instead change the package listing itself to define the result URL, that would let implementations imbed versions as they see fit, ie:

{
            "type": "packages",
            "id": "0c364ee1-0305-42ad-9fc9-2ec5a80c55fa",
            "result_url": "https://foo.bar/results.api?versions=all",
            "attributes": {
                "title": "Standard Test Survey (All Versions)",
                "name": "standard_test_survey",
                "created": "2015-11-26 02:59:24+00:00",
                "modified": "2017-12-04 15:54:44+00:00"
            }
},
{
            "type": "packages",
            "id": "0c364ee1-0305-42ad-9fc9-2ec5a80c55fa",
            "result_url": "https://foo.bar/results.api?version_start=1&version_end=3",
            "attributes": {
                "title": "Standard Test Survey (v1-v3)",
                "name": "standard_test_survey",
                "created": "2015-11-26 02:59:24+00:00",
                "modified": "2017-12-04 15:54:44+00:00"
            }
},

But I will stop beating that dead horse, up to you guys.

I don't think we can support new keys for different types. Maybe typing is a hint? For no other reason that invalid data is still data we would want to return in the results API, at least for warehousing purposes. IE, if you are asking for age and somebody return "old", sure that isn't numeric but it is data regardless and you would want that warehoused.

Since results are set in multiple places in a flow, the key around a result is essentially a slug in RapidPro. We can always generalize to the "broadest" type, that is to a string or something, but not being able to give any type of hint seems wrong.

I fear all these discussions are a bit moot without better use cases. Especially in light of that I really feel we should be trying to define the least possible, not the most possible. I also see it as a failure if implementations like RapidPro need to change in order to expose this API. We are pretty mature, we have lots of data and have lots of people using our API endpoints, it is unlikely that our view of the world is that far off.

That said, if someone creates a super compelling thing that reads FLOIP endpoints and does magic with it then surely we will be motivated to fall in line, so perhaps that is the missing piece.

nicpottier commented 4 years ago

So I promised Mark I would take another look at this in the context of creating a Data Studio Connector for RapidPro that would hopefully leverage FLOIP results. I've done that now and turns out that isn't a whole lot of work using just the plain RapidPro results endpoint but it has given me some more perspective as to this spec and what we think the right approach is.

After rereading the above, I think we feel very strongly that:

the goal here should be the smallest possible spec to bring utility, that eases adoption and if done right provides the right kind of flexibility
versioning should not be part of this spec in any way, that can be handled by the providers exposing separate packages if they care about versioning
packages will still describe the schema of a result with typing information per result id (as much as the provider wishes / can expose). That description does not need to be constant, it can be point in time.
the main need of versioning was to make the terse result endpoint possible, in that it returns a CSV like set of values, so we needed a contract that it always returned the same results as described in the schema. This is an overdue burden and really unnecessary, the result payload can just return a header that allows mapping of the index to the result id in the returned results.
clients should be written so that they are tolerant to the results endpoint returning either more fields or fewer fields than described in the schema, it is up to the clients how they want to deal with that.

With all the above different providers can be as strict or as loose as they want to be with regards to schemas and versioning but all clients can work the same way. That's the definition of a useful spec IMO.

Additionally there shouldn't be any "options" in that API, the above should be the entirety of the FLOIP results spec. Additions can be made as "extensions" if there is a need, but all implementations should do the above or else we will just have lots of incompatible implementations.

For RapidPro to add a FLOIP result endpoint we would need all the above, otherwise it just doesn't make sense to us. It is your spec in the end but that's our stand on it for inclusion in RapidPro. Let me know what you decide and we can go from there.

markboots commented 4 years ago

Hi Nic, thanks for your continued thinking about this. Sorry I missed this post earlier; I missed the Github notification while on holidays in November.

We're really keen to figure out a solution that allows RapidPro to be part of the Flow Results ecosystem. I don't think I fully understood your proposal and what it would mean for changes to the spec and current servers/clients.

the main need of versioning was to make the terse result endpoint possible, in that it returns a CSV like set of values, so we needed a contract that it always returned the same results as described in the schema. This is an overdue burden and really unnecessary, the result payload can just return a header that allows mapping of the index to the result id in the returned results.

What do you mean by the "terse result endpoint" here? Can you elaborate on the header idea ("the result payload can just return a header that allows mapping of the index to the result id in the returned results")?

Thanks!

nicpottier commented 4 years ago

What do you mean by the "terse result endpoint" here? Can you elaborate on the header idea ("the result payload can just return a header that allows mapping of the index to the result id in the returned results")?

I'm talking about the results looking like this:

[
                    "2015-11-26 04:33:26",
                    "11393115",
                    "10825354",
                    "47029339",
                    "1448506769745_42",
                    "Man",
                    {}
],

{
 "timestamp": "2015-11-26 04:33:26",
 "row_id": "11393115",
 "contact_id": "10825354",
 "sessoin_id": "47029339",
 "question_id": "1448506769745_42",
 "gender": "Man",
  "video": {}
}

The former is forcing a consistent schema across every result and is kind of what is getting us into this pickle of not being able to put all results and letting clients deal with extra data however they like. There really isn't any argument towards that terser format, pretty much everything in the world is gzipped on the wire and the keys in the JSON dictionary version are going to be compressed to nothing anyways.

The terser format is what is forcing us to have all this versioning stuff, because the array representation requires a set schema that matches every result perfectly. I don't think that's a realistic goal nor something we really see in the real world anyways. The schema can help interpret the results, yes, but results are going to be dirty sometimes, and contain data that may not have a good description anymore but still be valuable from a data archival perspective for example. (could it be even more valuable with a schema? sure, but we are going for lowest common denominator here)

A few updates to note as I just reread all this stuff:

things have changed a bit on the RapidPro side and we no longer have every revision of every flow going back to eternity. We've trimmed these to daily diffs. (we had some 150M rows of revisions)
we do not have revision mapping to results, IE, we can't tell you which revision of a flow a revision came from (results may have been spread across many revisions anyways)
I still feel very strongly about optional features, this should be the simplest possible thing
I still feel very strongly that there's no need for versioning to be part of the spec, server implementations that care to have strict versioning of results (IE, that decide they always want to return results of a specific version, not across all versions can do so via the ID they expose to fetch results)
I still feel very strongly that clients should be tolerant to extra result data (see above JSON representation) and can just ignore extra keys that exist that aren't in the schema they are given. That is still useful thing for some. Servers can choose never to do this, but clients must be tolerant to it.

So my proposal would be remove the idea of versioning completely from the spec. Have endpoints to get schemas, that can stay as it is. And have results endpoints return JSON dicts which clients MUST accept extra keys for. That gives servers the ability to either be super strict if they like (they can expose results endpoints that are per version if they like) or be super loose if they like, with no impact or different behavior needed by the clients.

markboots commented 4 years ago

Hi Nic, thanks for the clarification. As mentioned, I'm really keen to figure out how we can make Flow Results feasible for RapidPro, and I can appreciate the constraints here (not keeping all revisions, not having a mapping of results to revisions; results that are not strongly typed).

I've been thinking about your proposal, how it can work for the existing clients & servers, and what it means for future use-cases. Although maybe not my preference, I can get behind your proposals here:

Omitting versioning explicitly from the FR spec
Allowing extra results that are not found in the schema, and having clients be tolerant of that (recognizing the value for data archiving applications)

I think one design goal we should have is to optimize for how easy it is to build (useful) clients, given that down the road, I'd expect more client than server implementations. (e.g. maybe a dozen server platforms, possibly hundreds of apps that use the FR spec to grab results into dashboards, analytics, or for 3rd party data integrations). On considering tradeoffs, I'd go for choices that make it a bit harder to write servers, to make it easier to write clients.

Another design goal behind FR is that we should it design it to be feasibly scalable to flows with millions of result rows.

I tried to think through what this proposal would mean for a few likely use-cases of FR consumers:

1) A caching dashboard (like U-Report): Dashboards that poll an FR server regularly for recent data, and store/aggregate it on the dashboard side.

2) Data archivers Tools that poll regularly for recent data, and aim to archive it for central storage, or index it for analysis. These might store results in typed columns in a database (like the Nifi import module that Ona developed).

The one challenge I foresee with your proposal is the idea that the schema would changeable and "point in time". For clients this creates two specific cases that are challenging and would require significant extra handling:

Disappearance of questions from the schema when they re-request it for a Flow ID they already have partially ingested.
Changes to the data type / question type of a question ID they have already seen & stored.

@pld mentioned above that these situations would be problematic from Ona's perspective as well.

To try to figure out what might be feasible for RapidPro, I did a thought experiment: "If you built FR-like endpoints into RapidPro according to your proposal (with a mutable schema)... Could someone else build a 'proxying' endpoint in front that would serve Flow Results without those two issues?"

Assuming clients only accessed via the proxying endpoint... I think the answer is yes. If I was doing it, I would: 1) Cache the list of questions in the schema the first time I was asked for that Flow ID's schema 2) The next time I'm asked for the schema, get it from your live schema endpoint, and compare it with my cached questions. Re-add any deleted questions to the schema I return (marked now as "deleted"). If you are re-using question IDs that have a now-changed data type: generate a new question ID for the 'recent' version of this question, and keep a replacement mapping from your-question-id (and date of change) to replacement-question-id. 3) When serving the results endpoint, rewrite any question IDs found in the replacement mapping, after the change date.

What's feasible about this implementation is that it doesn't need access to a full history of revisions; it just needs to remember what it has previously shared with FR clients, and what has changed since clients started asking for those results. It also doesn't need to parse an entire (possibly large) result set to figure this out. If a RapidPro user makes thousands of revisions to a flow before any FR clients request that schema, then FR doesn't care about those changes.

There might be some things I'm missing here... e.g. What if there are (e.g. open-ended text) results logged in the result set from the start, that don't match within the first version of the schema shared (e.g. multiple choice with 3 discrete options). In this case, I guess, my proxying endpoint would null out their question ID, and treat them as 'unknown' data, using the way you suggested of being able to pass on results that don't match the schema.

If this goal can be achieved by a proxying endpoint in front of a "mutable schema" FR implementation, it seems possible (ideally) to write it directly inside a RapidPro implementation. It should be possible even without access to all revisions, or with a mapping from results to revisions.

With that, I think we can leave out the concept of versions entirely out of the spec, and just put in a couple rules on the ways that a schema for a Flow ID can/can't change after it is shared with FR clients.

I think providing a guarantee to clients of "no disappearing questions, no changing question types" is quite important to scalability. Without that guarantee, a client would need to check for schema changes before each FR pull... and if it saw schema changes, would need to dump and re-request the entire data set (millions of rows). This doesn't allow "sipping" only the most recent data. It creates a lot of edge cases that each client implementation would have to handle, especially those that produce dynamic dashboards depending on the question type, or those that archive data in strongly-typed DB columns. There would be some complex code to do diffs of the same schema requested over time, and figure out what to do based on what changes happened.

With the guarantee in place, clients are much easier to write: they can pull only the most recent results rows. If they ever see a new question ID they don't recognize, they can request the most recent version of the schema. None of their previously-ingested data ever "goes bad" due to breaking changes in more recent schemas of the same Flow ID.

Any other ideas for how to handle schema evolution nicely for clients, based on the 'proxy endpoint' thought experiment?

Would you be up for Flow Results that

doesn't deal with versions in the spec,
can pass extra data row data not found in the schema, but
guarantees that schemas previously served for a Flow ID never remove a question-ID or change a question-ID's type in more recent accesses of that schema?

===========

PS: A couple responses on why having the schema endpoint + "terse result" endpoint combo is useful, rather than a single endpoint that adds the metadata in each result row:

1) It keeps us compatible with the upstream Data Packages spec we chose to specialize, allowing existing tools written for Data Packages to work with Flow Results. Flow Results is a valid subset of Data Packages (just wrapped inside JSONApi if you're accessing it through an API).

2) The schema allows data collectors (e.g. M&E systems, data organization systems, data portals) to quickly answer: "What questions/topics does this data set provide?", without having to parse possibly millions of rows to find out what's available in that dataset. If I'm trying to determine: "Does this dataset have any answers for which waterpoint locations have been reported broken?", I can answer that quickly from the schema alone.

nicpottier commented 4 years ago

I think providing a guarantee to clients of "no disappearing questions, no changing question types" is quite important to scalability. Without that guarantee, a client would need to check for schema changes before each FR pull... and if it saw schema changes, would need to dump and re-request the entire data set (millions of rows).

I'm not following this. Why does a a client need to check for schema changes constantly? Seems like worst case it would need to check when it sees something it doesn't understand, and even that only if it even cares about that. (IE, maybe there's a new field but it has already been configured to show the existing fields)

Also don't understand why it would need to refetch all results, can you elaborate?

markboots commented 4 years ago

The short answer is in the discussion from @pld and me above on June 18/19:

https://github.com/FLOIP/flow-results/issues/34#issuecomment-503349788

Let me know if it would be helpful to elaborate or discuss. Thanks!

Regarding needing to refetch all results: it wouldn't necessarily need to refetch all results from the server, but it could need to reprocess/re-store/re-index all results based on the new (changed) type of the question.

nicpottier commented 4 years ago

So to be clear, RapidPro does not have strict typing of results. At best we have "hints" of what type a result may be, but we store results regardless of whether they are of that type or not.

I think to your example of what a proxy service could do, sure, given large changes and bending over backwards, RapidPro could try to expose something that satisfies this API. I don't think that is something we are interested in taking on though.

If you guys want to explore that work in a PR, we are happy to comment on it and help guide it to see if it might fit in, but at this time I think what you are requiring and what we do is too far apart given other priorities for the RapidPro core team.

markboots commented 4 years ago

To try to find a practical way forward: would you be willing to take on implementing Flow Results endpoints in release ~5.4 like we discussed previously, and ignoring the "steady-schema" requirement for now? (Effectively, it would be going ahead with your full proposal for now, but leaving the option open to improve it down the road if a way can be found.)

After that point, we or someone else might be able to attempt a PR that gets to the "steady schema" functionality that I believe is more helpful for clients. That work could be a lot more efficient if it was building on (or could reference) some existing Flow Results endpoints.

nicpottier commented 4 years ago

We may be open to that, note that unless I'm not understanding you correctly, that would also require us to return results in a dictionary key/value format as opposed to the current array format, as the array format only works with a static schema. Is that acceptable as well?

markboots commented 4 years ago

Could you provide an endpoint that provides the current array format, that is on a point-in-time basis consistent with another endpoint that provides the schema? As a starting-point goal, this has the value of achieving something that is at least point-in-time compatible with Flow Results, and works with Flow Results tools.

If you have results in the array format that don't work with any question-ids in the schema, then you could put the extra details into the Response Metadata column (array column 7), while setting Question Id (array column 5) to null.

No opposition to providing an additional endpoint in the dictionary format, if you think that's helpful.

nicpottier commented 4 years ago

I don't think an array result that could be totally different than the actual schema is useful, that seems overly fragile and isn't going to be much use to anybody, so that kind of fails the test of utility for us choosing to do that work in RapidPro.

To bring this back around, we have a result format in RapidPro already and it accomplishes what we need. There aren't currently any great benefits for us to expose a FLOIP endpoint as it doesn't introduce any new tools or capabilities that our customers can use.

We aren't going to add an API endpoint that we know is mostly broken for our use case, such as a fragile array response format. Anything we add may get built upon and then we have to guarantee compatibility for that thing moving forward, so we don't take adding new API endpoints lightly.

So while we will definitely keep track of progress on this front and be on the lookout for some big wins that we would get by adding a FLOIP endpoint, at this time I just don't think this makes sense for us.

markboots commented 4 years ago

Hi Nic,

Supporting the FLOIP endpoints as I described would allow RapidPro to be used with this software:

The libraries and tools supporting the Data Packages upstream spec: https://frictionlessdata.io/software/
Ona's Nifi module for streaming into Nifi-supported backends, or their Canopy system: https://github.com/onaio/floip-canopy
More software for dashboards and visualization that comes up as we try to build a community around this.
Ideally, the Flow Results -> Google Data Studio connector we hoped you would build as part of this project.

On that community-building process, it would be really helpful if RapidPro was a leader rather than a laggard in this data standardization process.

As mentioned before, the Flow Results schema format provides a lot of value in data organization by being able to answer: "What questions does this data set provide the answers to?", without having to load and parse an entire large dataset.

Concerns around fragility come from your request above to do a point-in-time schema, rather than the ideal "no disappearing questions, no changing question types" guarantee. The DataPackages-compatible array endpoint could become non-fragile by changing that decision :)

I know we've gone back and forth on many of these points now. I'd be eager to get on a phone call to try to get a better understanding of what parts are challenging on this.

nicpottier commented 4 years ago

We'll keep track of this going forward. At this point this isn't a feature that we are seeing any demand for (either from any of our customers or any of the other people running RapidPro) so we don't think it makes sense to add.

If the API was a bit closer to what we wanted (specifically had dict results so wasn't so fragile) then the effort for addition would be reasonable, but at this point the pros do not outweigh the cons.

We will keep track of the ecosystem though! Once we see exciting things that are useful to our customers we will definitely revisit.