OpenFn / wcs-wildmeat

Wildlife Conservation Society ConSocSci Project
1 stars 0 forks source link

Automation for Kobo-Postgresql metadata and data sync #7

Closed aleksa-krolls closed 3 years ago

aleksa-krolls commented 3 years ago

WCS Postgres Table Management Jobs

Background

WCS collects research data across several sites and surveys, which are managed by a number of staff. To ensure that all Kobo survey data can be securely and centrally captured for easier access and analysis, WCS would like to automate the process of integrating Kobo surveys and creating database tables to centrally store the data.

The specific request

Goal is to identify an automated or semi-automated process for going from a new or updated Kobo form to a Postgres set of holding tables. This will require an automated metadata and data sync between Kobo Toolbox and Postgresql.

state.json

Source: Kobo

Destination: Postgresql

adaptor

Assume the destination database will be Postgres - use language-postgresql.

expression.js

See original design notes for initial solution outline and questions. In sum, we'll need the following jobs:

(1) Metadata sync (upserts tables and columns)

  1. Job 1 creates tables and columns
  2. Job 2 (triggered by Job 1) creates/updates N jobs in OpenFn (/api/jobs/new)

(2) Data sync (upserts records)

  1. Job 3…N pulls data for specific forms and upserts those data to Postgres

(1) Metadata Integration

Job to automate the creation of the Postgres destination table based on Kobo xForm.

  1. Every X hrs, fetch form data from Kobo server to check for new or updated Kobo forms.
  2. Check whether a corresponding table already exists in the destination database.
  3. Create/update the database table accordingly.

Metadata mapping rules:

  1. Destination database table should be named using resource.name (form name -i.e., "Rural Consumption").
  2. Every destination table should have a kobo_payload JSONB column that captures the entire raw Message payload.
  3. Use Kobo question names to define the destination table column names.
  4. Any new Kobo questions added should trigger the creation of a new column.
  5. All repeat group data (determined by an array- i.e., "group_scores":[...]") should be captured in another table with a foreign key relationship to the parent table. Therefore any survey with a repeat group will map to 2 tables, one parent and one child (i.e., tbl_survey and tbl_survey_char).

(2) Data Integration

Jobs to automatically create and update the mappings from the Kobo form to the db table created in the above flow.

  1. ~~ Run this job after the Metadata Integration flow completes to ensure all destination db tables have been created/ updated. ~~This Job 2 (triggered by Job 1) creates/updates N jobs in OpenFn (/api/jobs/new). ~~
  2. ~~ The `N' jobs created should upsert the Kobo survey data in the db. ~~

Data mapping rules:

  1. Use Kobo question names to map surveys to the relevant destination column.
  2. If a uuid is needed for upserting Kobo survey data, use _id + _xform_id_string. This will ensure uniqueness across Kobo servers if WCS is hoping to connect multiple accounts._
  3. Map all repeat group questions to the child table created in job 1.
  4. Map the entire raw Message payload to the kobo_payload column created in job 1.

output.json

Fetching this Kobo Rural Consumption survey for the first time should:

  1. Create 2 Postgres destination tables: the parent tbl_wcs_rural_consumption_test and child tbl_wcs_rural_consumption_test_char for capturing repeat group data 2. Create 1 job to map the Kobo survey data to the Postgres tables created

Assume OpenFn will fetch all Kobo survey forms and submissions on a cron-timer (i.e., every 3 hours).

Toggl

Please log your time to the WCS - Automated Kobo-Postgres Sync project.

lakhassane commented 3 years ago

A bunch of questions here:

  1. How to manage columns that do not have a "classical" type value. (e.g. "type":"select_one", "type":"note"). Could we consider the type as varchar ?
  2. https://github.com/OpenFn/wcs/blob/af0e7e2ff81f95e230d26e657677f40c8ab99249/sample_data/sample-form-definition.json#L152-L220 I want to confirm that this snippet is exactly the list of columns. If it the case we can use it to map from it and get the type and constraint needed to build the table.
aleksa-krolls commented 3 years ago

@lakhassane responses...

  1. Yes, assume varchar for any string values.
  2. Yes, this does look like the list of columns... but (i) can you please this WCS Kobo form with the form-id afCPRfan8JdBt9sy9q48Jp in your development (the form you pulled is not a wcs form) , and (ii) we will also need to ensure we can map form metadata like id, _submission_time, etc. Are you able to extract the list of metadata fields from the Kobo API? (Or will we need to pre-identify these?)
lakhassane commented 3 years ago

@aleksa-krolls Those metadata fields are not extracted. So we might need to ID them.

aleksa-krolls commented 3 years ago

@lakhassane Can you please get me a list of all standard metadata fields Kobo adds to the xForm definition (i.e., _submission_time, id, uuid, _xform_id)? (Fetch an actual form submission to see these.) I can then get sign-off from the client regarding which ones are important to build into the jobs. cc @taylordowns2000

lakhassane commented 3 years ago

@aleksa-krolls I have fetched one from Kobo and have seen : version_id, start, end, device_id, _version__001, info_recall_date, _version_, __version__ today, date_constraint_calc

aleksa-krolls commented 3 years ago

@lakhassane What about other fields like _submission_time, id, uuid, _xform_id? I know these exist in every Kobo form submission. See example Kobo response to a GET request for form data here. Ideally we want all of these uploaded to the destination DB.

taylordowns2000 commented 3 years ago

From Slack:

  1. you fetch a form definition and it returns: {q1, q2, q3, etc.}
  2. there are certain metadata fields which will be part of the form submission which are NOT in the definition above.
  3. they will be found in EACH submission from Kobo, regardless of the form being used.
  4. they include things like _id, _uuid, etc. (edited)

Taylor 3 minutes ago (a) one might have {q1, q2, q3, more} (b) and the other (the submission) might have {start, q1, q2, q3, _id, _version, more} your job is to get a list of the things that appear in b but not a

Taylor 2 minutes ago here's the data url: https://kf.kobotoolbox.org/api/v2/assets/afCPRfan8JdBt9sy9q48Jp/data/?format=json to compare against the form url: https://kf.kobotoolbox.org/api/v2/assets/afCPRfan8JdBt9sy9q48Jp/