Closed mtwestra closed 9 years ago
Some thoughts about implementation: One option could be to use Xpath to do this. This would mean using xml as storage method for the hierarchy, and Xpath logic in Android to find items.
The reason why this is interesting is that in other issues, such as the formula-based display and validation logic, Xpath could also be a good solution, because Android has build in support for that. This means that a lot of functionality, such as date-time functions, str-number conversions etc, are available out of the box.
Relevant documents (via @iperdomo):
Comments on the technical specifications provided so far:
We can use this format for uploading admin units: https://taroworks.zendesk.com/hc/en-us/articles/200167369-Creating-and-Uploading-an-Options-List-for-Cascading-Selects-Questions
What we (@ichinaski, @iperdomo, @muloem )agreed: 1) @ichinaski will put the French admin boundaries (level 5) in a memory structure in Android to see how it behaves. 2) We then decide on the serialisation formation in s3 3) We will try to put the cascade node data in a kind, and create a web interface for manipulating it. This will have the benefit that parts of the data can be edited, without having to download the whole blob to the browser each time an edit is made.
Just realised that Google App Engine doesn't allow LIKE queries, so it might be good to use a simple kind with a parent, and not use materialised views. Sqlite, on the other hand, does support LIKE queries.
Materialised path storage of nodes:
Name path (int) ancestor path (string)
Spain 1 ""
dist1 1 1.
subregion1 1 1.1
subregion2 2 1.1
dist2 2 1.
subregion 1 1 1.2
dist3 3 1.
dist4 4 1.
To find all subregions under dist1: SELECT name from path where ancPath = 1.1
The full materialised path of a node is ancPath + path
To store cascading question resources, we need two different GAE kinds: 1) Storing the resource identifier and name, with a version: cascadeResource { String name, int version, }
2) Storing the individual nodes: cascadeNode { String name int path, String ancPath }
A cascading resource is a separate entity, which can be used by multiple questions in different forms
A cascading resource should have a 'publish' button. When this is clicked, the sqlite database file is created, with a version number, and stored at S3. If new versions are created, the old ones remain available for earlier forms. A specific form uses a specific version of the cascading resource. If you have updated the resource and want forms to use it, you have to select it in the form question, and republish the form.
We need a service on GAE that reads a csv file and populates the cascading structure in GAE. In that way, we don't need to store all the data in each instance, but can load on demand. The data is copied into the GAE datastore, which means that it can be then adapted in the dashboard, for example to add additional levels, or to change or delete certain items. In addition, this can be used to parse user-contributed hiearchies. It might also be a good idea to also offer export to csv file later.
This could perhaps be implemented by having a service on FLOW services that uses the remote API to put values in the datastore. In addition, a service on flow services could handle creation of the sqlite database based on the data in GAE.
For each node, the following editing should be possible:
In addition, we should have:
To create new nodes on a certain level, we should have an edit button that will both accept single items and lists of items separated by newlines. If a list of items is entered, it splits them up into multiple items on that level.
A question is how we can make sure that the system can use the fact that these hiearchies refer to specific geographical locations. It would be nice if when te data is filtered on a subregion, for example, the system can also display the outline of that subregion. Is there a way to keep that link? Perhaps there is a numbering scheme, some kind of unique id for each admin area?
If we standardise on the GADM database, this contains identifiers. So potentially, we could give a 'label' to each node that we read in, which contains the level (0...5) and the GADM-id. This can then be used later to display geographical boundaries.
In addition, we need to think about how this relates to the possibility to reverse georeference points. Perhaps views (such as reports, filters) could work on both: the automatically georeferenced fields, or the cascading question results.
For publishing, an alternative could be to let GAE write a csv file with the materialised path elements as it has in its datastore to s3, and then fire a task to flowservices which has the job of converting it to an sqlite datastore.
One additional comment on the S3 usage: The app can use the published database MD5 hash to decide when to re-fetch the resource. Not that I'm saying anything new here, but just to point out an added benefit of the serialization approach.
good point
First mockup of dashboard part: https://www.dropbox.com/s/2kxzg1j5ingw3dp/cascade.pdf?dl=0
Update on the serialization approach status: While attempting to implement the model we defined the previous day (https://github.com/akvo/akvo-product-design/issues/33#issuecomment-57792076), I realized we don't even need to store the concatenated values of the ancestors, as each row will have a unique row id, thus we can simply use such value to determine the parent of a node.
So, a node in the tree is represented by a unique id
, and its value
. Additionally, the reference to the parent (ancestor node) is just an id
as well, which is in essence the parent's node id
.
This way we get to store ancestors paths in a more compact way, directly referencing a row id in the table.
This is what it looks like:
Issuing queries is still the same process:
Using this approach, the whole France dataset takes up 2.1MB, and when compressed, 1.1MB
@ichinaski, sounds good! And what about the query performance?
cheers, mark On 6 Oct 2014, at 16:15, Iñigo notifications@github.com wrote:
Update on the serialization approach status: While attempting to implement the model we defined the previous day (#33 (comment)), I realized we don't even need to store the concatenated values of the ancestors, as each row will have a unique row id, thus we can simply use such value to determine the parent of a node.
So, a node in the tree is represented by a unique id, and its value. Additionally, the reference to the parent (ancestor node) is just an id as well, which is in essence the parent's node id.
This way we get to store ancestors paths in a more compact way, directly referencing a row id in the table.
This is what it looks like:
Issuing queries is still the same process:
Using this approach, the whole France dataset takes up 2.1MB, and when compressed, 1.1MB
— Reply to this email directly or view it on GitHub.
@mtwestra The queries are merely looking up an indexed column value, so they are as fast as they can get.
@ichinaski, wow, nice work indeed! mark
On 6 Oct 2014, at 16:23, Iñigo notifications@github.com wrote:
@mtwestra The queries are merely looking up an indexed column value, so they are as fast as they can get.
— Reply to this email directly or view it on GitHub.
Additional features needed: 1) a code per item (just like the option questions) 2) a 'name of admin boundary', such as 'Region', 'District', etc
Serialization will be format: sqlite database Device downloads database from s3 Example: French boundaries data up to 5 levels takes 2 Mb in sqlite, 1Mb in zipped version Need question type: cascading question, where we have a title, and a url for the data base The cascading question database will be a resource that can be used by multiple questions
Admin boundary names: in an additional table in the sqlite database Needed on backend: 1) new question type 2) upload of excel file by users - could be handled manually first 3) UI to create structures. 4) how to create and serialize the sqlite database on flow services
Serialization proces:
Similar to video / image tooltip
For the codes - only present in the backend.
Serialization will be format: sqlite database
As discussed with @mtwestra, for the moment being the following naming conventions will be used:
A first attempt at the GAE - flow services interactions:
Different workflows: 1a) user creates a cascade structure entirely on the dashboard 1b) user uploads an excel file which is then turned into a cascade structure in the dashboard 1c) user selects an existing (csv) file provided by Akvo, which is then turned into a cascade structure in the dashboard, where it can be edited / appended to
2) When the user is happy with the cascade structure in the dashboard, the user clicks 'publish' 3) flow services constructs a new csv file from the cascade information in GAE, similar to the raw data report 4) flow services constructs a sqlite database from the csv file, and uploads it to S3
You will notice that in the case of 1c, this means the the csv file makes a round trip. However, this is needed as the user can then use the cascade structure provided by Akvo as a template, which can be edited and added to.
The reason to let flow services construct a csv file is that this makes it easier to share these to other instances, as they will have the same format as the csv files that are uploaded.
If the CSV is generated in flow-services, what's the serialization format used to transmit the data from GAE to flow-services?
Also, we need to figure out a reliable workflow between GAE <--> flow-services, based on the status of the publishing to S3, DB generation, etc. Will flow-services need to report back to GAE of successful/failed publications?
2) When the user is happy with the cascade structure in the dashboard, the user clicks 'publish'
Are we supporting versions of a particular cascading dataset ? or we'll just delegate that to S3 ?
If the CSV is generated in flow-services, what's the serialization format used to transmit the data from GAE to flow-services?
A question will be how to assemble the csv in the flow services, as the data store call will probably simply get all nodes for a given cascade resource. Reestablishing the tree structure will need to be done in flow services
@iperdomo I think we should keep track of published versions, so we can put the version of the resource in the url name for the device. In that way, the device can check if it needs to redownload the resource, even if it is redownloading a specific survey. The resource download will potentially be expensive due to the size.
the device can check if it needs to redownload the resource, even if it is redownloading a specific survey. The resource download will potentially be expensive due to the size.
This is simple checksum/etag check.
@iperdomo, Ah, that is a good idea. So you can get the checksum before download, and decide based on that?
On 28 Oct 2014, at 17:35, Iván Perdomo notifications@github.com wrote:
the device can check if it needs to redownload the resource, even if it is redownloading a specific survey. The resource download will potentially be expensive due to the size.
This is simple checksum/etag check.
— Reply to this email directly or view it on GitHub.
@mtwestra yeap, it will be 2 HTTP requests though ... a HEAD
and you can get the etag
for the object, as an HTTP Header, then if different, you can make the GET
.
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.4
The HEAD
approach @iperdomo describes is exactly what I had I mind too: https://github.com/akvo/akvo-product-design/issues/33#issuecomment-57795197
In GAE we store a cascade resource as a tree consisting of nodes. Each node is stored like this:
Node: {id, name, code, parentId, cascadeId}
The id's are generated by Google App Engine, and are therefore random integers of 16 digits, or in the legacy case, shorter integers
Root elements have parentId = 0
In the sqlite file, we store all elements like this:
{id, value, parent}
The ideal case is to have the ids consecutive and unique, as this will require the least storage space. However, as a start, we could simply use the same ids as in GAE, as this will simplify the generation of the sqlite file.
In that case, the process looks like this:
1) GAE notifies flow services that it should create a cascade sqlite file. This will use a call:
http://services.akvoflow.org/publish-cascade
with in the body:
"baseURL":"http://xxxxx.akvoflow.org", "cascadeResourceId": 4837483}
2) FLOW Services opens a temporary sqlite database:
f47ac10b-58cc-4372-a567-0e02b2c3d479.sqlite
with a single table: nodes
3) FLOW services uses the remote api to connect to GAE
4) FLOW services reads trough the nodes it reads from GAE, and puts them in the sqlite database node table.
Node: {id, name, code, parentId, cascadeId} is mapped to sqlite row: {id, name, parentId}
5) FLOW services reads the header titles from the CascadeResource using the Remote API
6) FLOW servivces closes the sqlite database and publishes it to S3, in the /surveys folder, with the name:
cascadeResourceID.sqlite
This should be sufficient to get the system working for manually entered lists. Later, we can add the generation of a csv file based on the sqlite data, with the data in a 'normal form':
France, ÃŽle-de-France, Seine-et-Marne, Fontainebleau, La Chapelle-la-Reine, Ury
France, ÃŽle-de-France, Seine-et-Marne, Fontainebleau, La Chapelle-la-Reine, Le VaudouÃ
France, ÃŽle-de-France, Seine-et-Marne, Fontainebleau, La Chapelle-la-Reine, Villiers-sous-Grez
France, Île-de-France, Seine-et-Marne, Fontainebleau, Château-Landon,Arville
France, Île-de-France, Seine-et-Marne, Fontainebleau, Château-Landon, Aufferville
I agree with the general workflow, looks like a good start.
That said, I don't particularly like the proposed datastore IDs usage. Why is this reference necessary? Why not just rely on an autoincremental value on the SQLite db?
@ichinaski well, in this way it is a one-pass job. As the nodes are not ordered in any way, we can't simply insert them and give them a new id, as other items that might come later would also need to be 'translated' in that way.
Of course it is certainly possible to do a second pass, or keep in memory a 'translation table' that is used to create better ids.
I still fail to see that dependency requirement. I see two ways to keep track of other row's paths:
Note: the first approach is more efficient probably, but it was quicker to implement a POC with the second one.
Yes, that could certainly work. ad 1) this is not streaming of course - you need to have the whole translation map in memory. But as this is just integers, that should not be a problem ad 2) yes, this will be slower I guess
But the main question is: is this premature optimalisation, or something that we should do from the start?
approach 2) is not that slow. It's just more inefficient than the first one. If you run the script with the France data, you'll probably see the DB is generated in ~1-3 seconds. I guess this falls under the accepted estimated time, doesn't it?
Being a small change, I don't consider either approach a premature optimization. It'll take the same time to implement as the original approach.
The header names will be part of the survey XML, so we can incorporate translations. They will therefore not be part of the sqlite file
First set of designs here: https://www.dropbox.com/s/owaz0zoizt3s58m/33-Cascading_questions-WF-V1.pdf?dl=0
Still to do is all the import part.
Updated the pdf to reflect some feedback from @mtwestra.
Survey XML serialization, with cascading question support:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<survey name="Fake survey (Do not edit/publish!)" defaultLanguageCode="en" version='2.0' surveyGroupId="15753002" surveyGroupName="Cascading test">
<questionGroup>
<heading>Group #1</heading>
<question order="1" type="free" mandatory="false" localeNameFlag="true" id="19113002">
<text>Name</text>
</question>
<question order="2" type="geo" mandatory="false" localeNameFlag="false" localeLocationFlag="true" id="18693003">
<text>Location</text>
</question>
<question order="3" type="cascade" mandatory="false" src="cascading-00001-v1.sqlite" id="18693003">
<text>Location</text>
<levels>
<level>
<text>Country</text>
<altText type="translation" language="es">País</altText>
</level>
<level>
<text>Region</text>
<altText type="translation" language="es">Región</altText>
</level>
<level>
<text>Department</text>
<altText type="translation" language="es">Departamento</altText>
</level>
<level>
<text>District</text>
<altText type="translation" language="es">Distrito</altText>
</level>
<level>
<text>Canton</text>
<altText type="translation" language="es">Cantón</altText>
</level>
<level>
<text>Commune</text>
<altText type="translation" language="es">Comunidad</altText>
</level>
</levels>
</question>
</questionGroup>
</survey>
The Resource ID in the xml will be serialised like this: cascading-548758475-v3.sqlite
Overview
The aim is to introduce a question type that can handle hierarchical options. The main purpose of this question type is to enable the enumerator to specify the administrative boundaries of a specific location, such as the Region, District, sub-district, and community. This is usually called 'Cascading questions'
When accurate shapefiles are available for the administrative boundaries, GPS could in theory be used to determine this. However, shapefile are often inaccurate (mainly because of regular new devisions of administrative regions), and often don't go down to the required level (eg. community.)
The functionality should include:
Linked to https://github.com/akvo/akvo-flow/issues/189
Links to documents
https://github.com/akvo/akvo-product-design/blob/master/FLOW/Features/33-HierarchicalQuestionType/FunctionalDesign.md https://github.com/akvo/akvo-product-design/blob/master/FLOW/Features/33-HierarchicalQuestionType/VisualDesign.md https://github.com/akvo/akvo-product-design/blob/master/FLOW/Features/33-HierarchicalQuestionType/TechnicalDesign.md