Cascading question type

mtwestra commented 10 years ago

Overview

The aim is to introduce a question type that can handle hierarchical options. The main purpose of this question type is to enable the enumerator to specify the administrative boundaries of a specific location, such as the Region, District, sub-district, and community. This is usually called 'Cascading questions'

When accurate shapefiles are available for the administrative boundaries, GPS could in theory be used to determine this. However, shapefile are often inaccurate (mainly because of regular new devisions of administrative regions), and often don't go down to the required level (eg. community.)

The functionality should include:

A way for a user to specify a hierarchy of options. This needs to be done separate from the question, as the hierarchy might be used again. It is a resource for the instance.
A way for a user to create a hierarchy question and select the hierarchy created.
A way for the enumerator to interact with the hierarchy question and select the required values
A way for the resulting data to be stored and meaningfully used and exported in reports

Linked to https://github.com/akvo/akvo-flow/issues/189

Links to documents

https://github.com/akvo/akvo-product-design/blob/master/FLOW/Features/33-HierarchicalQuestionType/FunctionalDesign.md https://github.com/akvo/akvo-product-design/blob/master/FLOW/Features/33-HierarchicalQuestionType/VisualDesign.md https://github.com/akvo/akvo-product-design/blob/master/FLOW/Features/33-HierarchicalQuestionType/TechnicalDesign.md

mtwestra commented 10 years ago

Some thoughts about implementation: One option could be to use Xpath to do this. This would mean using xml as storage method for the hierarchy, and Xpath logic in Android to find items.

The reason why this is interesting is that in other issues, such as the formula-based display and validation logic, Xpath could also be a good solution, because Android has build in support for that. This means that a lot of functionality, such as date-time functions, str-number conversions etc, are available out of the box.

ichinaski commented 10 years ago

Relevant documents (via @iperdomo):

Trees in SQL: Nested Sets and Materialized Path: https://communities.bmc.com/docs/DOC-9902
Materialized path SO question: https://stackoverflow.com/questions/7300896/select-all-rows-that-have-id-in-materialized-path

ichinaski commented 10 years ago

Comments on the technical specifications provided so far:

As for tree-like data serialization in JSON, http://www.graphjson.org/ provides a standardized way to represent graph information in JSON. This can be used in multiple points in the system we are designing: a) as a storage format in the backend, b) as a serialization format between the backend and the client[s], c) potentially even the client[s] could use this format to store the data if a database turns out not to be suitable.
To store the serialized data (be it JSON, or whatever, let's just think of it as an abstract blob), we should consider the usage of S3. Both the backend and the app are fully capable of dealing with S3-stored files. Having a single link in the survey definition (XML) will keep the file small, allowing the device to fetch the cascading question data asynchronously. This is in fact the same design large question attachments (image and video tooltips) have at the moment.
Although the materialized-path querying approach described in the document is an interesting solution for this problem, and probably the most efficient one when it comes to huge datasets, I wouldn't rule out the in-memory graph-like data structure yet. Given that both the backend and the app are built in Java, we could leverage this property and deploy a common solution for them. This data structure could even be populated with the data blob, hugely simplifying the implementation of the whole feature. Note that I'm not suggesting the usage of this approach over the SQL-based one, I'm just trying to give another point of view. The performance, and even the viability of the in-memory approach is subject to the dataset size, and sets larger than 5-10 MB are obviously something we should not allocate in the RAM.

mtwestra commented 10 years ago

We can use this format for uploading admin units: https://taroworks.zendesk.com/hc/en-us/articles/200167369-Creating-and-Uploading-an-Options-List-for-Cascading-Selects-Questions

What we (@ichinaski, @iperdomo, @muloem )agreed: 1) @ichinaski will put the French admin boundaries (level 5) in a memory structure in Android to see how it behaves. 2) We then decide on the serialisation formation in s3 3) We will try to put the cascade node data in a kind, and create a web interface for manipulating it. This will have the benefit that parts of the data can be edited, without having to download the whole blob to the browser each time an edit is made.

mtwestra commented 10 years ago

Just realised that Google App Engine doesn't allow LIKE queries, so it might be good to use a simple kind with a parent, and not use materialised views. Sqlite, on the other hand, does support LIKE queries.

mtwestra commented 10 years ago

Materialised path storage of nodes:

Name                    path (int)  ancestor path (string)
Spain                   1           ""
    dist1               1           1.
        subregion1      1           1.1
        subregion2      2           1.1
    dist2               2           1.
        subregion 1     1           1.2
    dist3               3           1.
    dist4               4           1.

To find all subregions under dist1: SELECT name from path where ancPath = 1.1

The full materialised path of a node is ancPath + path

To store cascading question resources, we need two different GAE kinds: 1) Storing the resource identifier and name, with a version: cascadeResource { String name, int version, }

2) Storing the individual nodes: cascadeNode { String name int path, String ancPath }

A cascading resource is a separate entity, which can be used by multiple questions in different forms

A cascading resource should have a 'publish' button. When this is clicked, the sqlite database file is created, with a version number, and stored at S3. If new versions are created, the old ones remain available for earlier forms. A specific form uses a specific version of the cascading resource. If you have updated the resource and want forms to use it, you have to select it in the form question, and republish the form.

We need a service on GAE that reads a csv file and populates the cascading structure in GAE. In that way, we don't need to store all the data in each instance, but can load on demand. The data is copied into the GAE datastore, which means that it can be then adapted in the dashboard, for example to add additional levels, or to change or delete certain items. In addition, this can be used to parse user-contributed hiearchies. It might also be a good idea to also offer export to csv file later.

This could perhaps be implemented by having a service on FLOW services that uses the remote API to put values in the datastore. In addition, a service on flow services could handle creation of the sqlite database based on the data in GAE.

For each node, the following editing should be possible:

delete node (and all subnodes)
use as root (which deletes all parents)
edit name

In addition, we should have:

create new node

To create new nodes on a certain level, we should have an edit button that will both accept single items and lists of items separated by newlines. If a list of items is entered, it splits them up into multiple items on that level.

A question is how we can make sure that the system can use the fact that these hiearchies refer to specific geographical locations. It would be nice if when te data is filtered on a subregion, for example, the system can also display the outline of that subregion. Is there a way to keep that link? Perhaps there is a numbering scheme, some kind of unique id for each admin area?

If we standardise on the GADM database, this contains identifiers. So potentially, we could give a 'label' to each node that we read in, which contains the level (0...5) and the GADM-id. This can then be used later to display geographical boundaries.

In addition, we need to think about how this relates to the possibility to reverse georeference points. Perhaps views (such as reports, filters) could work on both: the automatically georeferenced fields, or the cascading question results.

mtwestra commented 10 years ago

For publishing, an alternative could be to let GAE write a csv file with the materialised path elements as it has in its datastore to s3, and then fire a task to flowservices which has the job of converting it to an sqlite datastore.

ichinaski commented 10 years ago

One additional comment on the S3 usage: The app can use the published database MD5 hash to decide when to re-fetch the resource. Not that I'm saying anything new here, but just to point out an added benefit of the serialization approach.

mtwestra commented 10 years ago

good point

mtwestra commented 10 years ago

First mockup of dashboard part: https://www.dropbox.com/s/2kxzg1j5ingw3dp/cascade.pdf?dl=0

ichinaski commented 10 years ago

Update on the serialization approach status: While attempting to implement the model we defined the previous day (https://github.com/akvo/akvo-product-design/issues/33#issuecomment-57792076), I realized we don't even need to store the concatenated values of the ancestors, as each row will have a unique row id, thus we can simply use such value to determine the parent of a node.

So, a node in the tree is represented by a unique id, and its value. Additionally, the reference to the parent (ancestor node) is just an id as well, which is in essence the parent's node id.

This way we get to store ancestors paths in a more compact way, directly referencing a row id in the table.

This is what it looks like:

Issuing queries is still the same process:

Using this approach, the whole France dataset takes up 2.1MB, and when compressed, 1.1MB

mtwestra commented 10 years ago

@ichinaski, sounds good! And what about the query performance?

cheers, mark On 6 Oct 2014, at 16:15, Iñigo notifications@github.com wrote:

Update on the serialization approach status: While attempting to implement the model we defined the previous day (#33 (comment)), I realized we don't even need to store the concatenated values of the ancestors, as each row will have a unique row id, thus we can simply use such value to determine the parent of a node.

So, a node in the tree is represented by a unique id, and its value. Additionally, the reference to the parent (ancestor node) is just an id as well, which is in essence the parent's node id.

This way we get to store ancestors paths in a more compact way, directly referencing a row id in the table.

This is what it looks like:

Issuing queries is still the same process:

Using this approach, the whole France dataset takes up 2.1MB, and when compressed, 1.1MB

— Reply to this email directly or view it on GitHub.

ichinaski commented 10 years ago

@mtwestra The queries are merely looking up an indexed column value, so they are as fast as they can get.

mtwestra commented 10 years ago

@ichinaski, wow, nice work indeed! mark

On 6 Oct 2014, at 16:23, Iñigo notifications@github.com wrote:

@mtwestra The queries are merely looking up an indexed column value, so they are as fast as they can get.

— Reply to this email directly or view it on GitHub.

mtwestra commented 10 years ago

Additional features needed: 1) a code per item (just like the option questions) 2) a 'name of admin boundary', such as 'Region', 'District', etc

mtwestra commented 10 years ago

Serialization will be format: sqlite database Device downloads database from s3 Example: French boundaries data up to 5 levels takes 2 Mb in sqlite, 1Mb in zipped version Need question type: cascading question, where we have a title, and a url for the data base The cascading question database will be a resource that can be used by multiple questions

Admin boundary names: in an additional table in the sqlite database Needed on backend: 1) new question type 2) upload of excel file by users - could be handled manually first 3) UI to create structures. 4) how to create and serialize the sqlite database on flow services

Serialization proces:

GAE stores the tree nodes.
You click publish on the tree
GAE creates a CSV file and stores it somewhere
Flow services transforms the csv into the database and stores it in S3
Device knows where to pick up the database

Similar to video / image tooltip

For the codes - only present in the backend.

iperdomo commented 10 years ago

Serialization will be format: sqlite database

https://www.sqlite.org/appfileformat.html

ichinaski commented 10 years ago

As discussed with @mtwestra, for the moment being the following naming conventions will be used:

The question type will be cascade.
The XML question tag will include an attribute src, which indicates the location of the question data (sqlite database).

mtwestra commented 10 years ago

A first attempt at the GAE - flow services interactions:

Different workflows: 1a) user creates a cascade structure entirely on the dashboard 1b) user uploads an excel file which is then turned into a cascade structure in the dashboard 1c) user selects an existing (csv) file provided by Akvo, which is then turned into a cascade structure in the dashboard, where it can be edited / appended to

2) When the user is happy with the cascade structure in the dashboard, the user clicks 'publish' 3) flow services constructs a new csv file from the cascade information in GAE, similar to the raw data report 4) flow services constructs a sqlite database from the csv file, and uploads it to S3

You will notice that in the case of 1c, this means the the csv file makes a round trip. However, this is needed as the user can then use the cascade structure provided by Akvo as a template, which can be edited and added to.

The reason to let flow services construct a csv file is that this makes it easier to share these to other instances, as they will have the same format as the csv files that are uploaded.

ichinaski commented 10 years ago

If the CSV is generated in flow-services, what's the serialization format used to transmit the data from GAE to flow-services?

Also, we need to figure out a reliable workflow between GAE <--> flow-services, based on the status of the publishing to S3, DB generation, etc. Will flow-services need to report back to GAE of successful/failed publications?

iperdomo commented 10 years ago

2) When the user is happy with the cascade structure in the dashboard, the user clicks 'publish'

Are we supporting versions of a particular cascading dataset ? or we'll just delegate that to S3 ?

iperdomo commented 10 years ago

If the CSV is generated in flow-services, what's the serialization format used to transmit the data from GAE to flow-services?

Remote API

mtwestra commented 10 years ago

A question will be how to assemble the csv in the flow services, as the data store call will probably simply get all nodes for a given cascade resource. Reestablishing the tree structure will need to be done in flow services

@iperdomo I think we should keep track of published versions, so we can put the version of the resource in the url name for the device. In that way, the device can check if it needs to redownload the resource, even if it is redownloading a specific survey. The resource download will potentially be expensive due to the size.

iperdomo commented 10 years ago

the device can check if it needs to redownload the resource, even if it is redownloading a specific survey. The resource download will potentially be expensive due to the size.

This is simple checksum/etag check.

mtwestra commented 10 years ago

@iperdomo, Ah, that is a good idea. So you can get the checksum before download, and decide based on that?

On 28 Oct 2014, at 17:35, Iván Perdomo notifications@github.com wrote:

the device can check if it needs to redownload the resource, even if it is redownloading a specific survey. The resource download will potentially be expensive due to the size.

This is simple checksum/etag check.

— Reply to this email directly or view it on GitHub.

iperdomo commented 10 years ago

@mtwestra yeap, it will be 2 HTTP requests though ... a HEAD and you can get the etag for the object, as an HTTP Header, then if different, you can make the GET.

iperdomo commented 10 years ago

The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.4

ichinaski commented 10 years ago

The HEAD approach @iperdomo describes is exactly what I had I mind too: https://github.com/akvo/akvo-product-design/issues/33#issuecomment-57795197

mtwestra commented 10 years ago

In GAE we store a cascade resource as a tree consisting of nodes. Each node is stored like this:

Node: {id, name, code, parentId, cascadeId}

The id's are generated by Google App Engine, and are therefore random integers of 16 digits, or in the legacy case, shorter integers

Root elements have parentId = 0

In the sqlite file, we store all elements like this:

{id, value, parent}

The ideal case is to have the ids consecutive and unique, as this will require the least storage space. However, as a start, we could simply use the same ids as in GAE, as this will simplify the generation of the sqlite file.

In that case, the process looks like this:

1) GAE notifies flow services that it should create a cascade sqlite file. This will use a call:

http://services.akvoflow.org/publish-cascade

with in the body:

"baseURL":"http://xxxxx.akvoflow.org", "cascadeResourceId": 4837483}

2) FLOW Services opens a temporary sqlite database:

f47ac10b-58cc-4372-a567-0e02b2c3d479.sqlite

with a single table: nodes

3) FLOW services uses the remote api to connect to GAE

4) FLOW services reads trough the nodes it reads from GAE, and puts them in the sqlite database node table.

Node: {id, name, code, parentId, cascadeId} is mapped to sqlite row: {id, name, parentId}

5) FLOW services reads the header titles from the CascadeResource using the Remote API

6) FLOW servivces closes the sqlite database and publishes it to S3, in the /surveys folder, with the name:

cascadeResourceID.sqlite

This should be sufficient to get the system working for manually entered lists. Later, we can add the generation of a csv file based on the sqlite data, with the data in a 'normal form':

France, ÃŽle-de-France,   Seine-et-Marne, Fontainebleau,  La Chapelle-la-Reine,   Ury
France, ÃŽle-de-France,   Seine-et-Marne, Fontainebleau,  La Chapelle-la-Reine,   Le VaudouÃ
France, ÃŽle-de-France,   Seine-et-Marne, Fontainebleau,  La Chapelle-la-Reine,   Villiers-sous-Grez
France, ÃŽle-de-France,   Seine-et-Marne, Fontainebleau,  ChÃ¢teau-Landon,Arville
France, ÃŽle-de-France,   Seine-et-Marne, Fontainebleau,  ChÃ¢teau-Landon, Aufferville

ichinaski commented 10 years ago

I agree with the general workflow, looks like a good start.

That said, I don't particularly like the proposed datastore IDs usage. Why is this reference necessary? Why not just rely on an autoincremental value on the SQLite db?

mtwestra commented 10 years ago

@ichinaski well, in this way it is a one-pass job. As the nodes are not ordered in any way, we can't simply insert them and give them a new id, as other items that might come later would also need to be 'translated' in that way.

Of course it is certainly possible to do a second pass, or keep in memory a 'translation table' that is used to create better ids.

ichinaski commented 10 years ago

I still fail to see that dependency requirement. I see two ways to keep track of other row's paths:

Create a GAE <--> SQLite map while parsing the data, using newly created, GAE independent IDs
Rely on the autoincremental SQLite values, and before parsing/inserting a new row, execute a DB query, looking for the ancestor IDs already present in the DB. This is the approach used in the Python script I shared: https://gist.github.com/ichinaski/5ca9374304bc5df8933d#file-serialize-py-L46-L53

Note: the first approach is more efficient probably, but it was quicker to implement a POC with the second one.

mtwestra commented 10 years ago

Yes, that could certainly work. ad 1) this is not streaming of course - you need to have the whole translation map in memory. But as this is just integers, that should not be a problem ad 2) yes, this will be slower I guess

But the main question is: is this premature optimalisation, or something that we should do from the start?

ichinaski commented 10 years ago

approach 2) is not that slow. It's just more inefficient than the first one. If you run the script with the France data, you'll probably see the DB is generated in ~1-3 seconds. I guess this falls under the accepted estimated time, doesn't it?

Being a small change, I don't consider either approach a premature optimization. It'll take the same time to implement as the original approach.

mtwestra commented 10 years ago

The header names will be part of the survey XML, so we can incorporate translations. They will therefore not be part of the sqlite file

loicsans commented 10 years ago

First set of designs here: https://www.dropbox.com/s/owaz0zoizt3s58m/33-Cascading_questions-WF-V1.pdf?dl=0

Still to do is all the import part.

loicsans commented 10 years ago

Updated the pdf to reflect some feedback from @mtwestra.

ichinaski commented 10 years ago

Survey XML serialization, with cascading question support:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<survey name="Fake survey (Do not edit/publish!)" defaultLanguageCode="en" version='2.0' surveyGroupId="15753002" surveyGroupName="Cascading test">
    <questionGroup>
        <heading>Group #1</heading>
        <question order="1" type="free" mandatory="false" localeNameFlag="true" id="19113002">
            <text>Name</text>
        </question>
        <question order="2" type="geo" mandatory="false" localeNameFlag="false" localeLocationFlag="true" id="18693003">
            <text>Location</text>
        </question>
        <question order="3" type="cascade" mandatory="false" src="cascading-00001-v1.sqlite" id="18693003">
            <text>Location</text>
            <levels>
                <level>
                    <text>Country</text>
                    <altText type="translation" language="es">País</altText>
                </level>
                <level>
                    <text>Region</text>
                    <altText type="translation" language="es">Región</altText>
                </level>
                <level>
                    <text>Department</text>
                    <altText type="translation" language="es">Departamento</altText>
                </level>
                <level>
                    <text>District</text>
                    <altText type="translation" language="es">Distrito</altText>
                </level>
                <level>
                    <text>Canton</text>
                    <altText type="translation" language="es">Cantón</altText>
                </level>
                <level>
                    <text>Commune</text>
                    <altText type="translation" language="es">Comunidad</altText>
                </level>
            </levels>
        </question>
    </questionGroup>
</survey>

mtwestra commented 10 years ago

The Resource ID in the xml will be serialised like this: cascading-548758475-v3.sqlite

akvo / akvo-product-design

Cascading question type #33

Overview

Links to documents