EcoNet-NZ / inaturalist-to-cams

Synchronises observations from iNaturalist to the CAMS Weed App
Apache License 2.0
3 stars 3 forks source link

iNaturalist to CAMS synchroniser

This repository contains a scheduled workflow, configuration and code to synchronise iNaturalist observations to the CAMS Weed App (running on the ArcGIS Online platform).

Overview

The CAMS Weed App enables ongoing monitoring and control of weeds, showing different colours and shapes for the current status of the weed patch. The status is reset periodically to Purple - please check and the status is updated as each patch is checked:

map showing weed status symbols

The status of each observation can be updated by adding the observation to the Weed Management Aotearoa NZ iNaturalist project and setting the observation fields. See the user guide for instructions.

The code is intended to be scheduled to run regularly, e.g. hourly, picking up new and updated observations from iNaturalist. Note that the updates to CAMS are idempotent, so can be rerun without creating new CAMS records. The synchronisation will only pick up new or updated observations containing updates that we are interested in.

We keep this log of all observations and updates that have been synchronised.

The iNaturalist observations are selected based on taxon and place (e.g. old man's beard in Wellington). Each matching iNaturalist observation creates a new Feature in CAMS, with a parent WeedLocation record and a child WeedVisit record. Updates to the iNaturalist observation may create additional WeedVisit records, dependent on what caused the update, for example:

sequenceDiagram
Actor User
participant iNat as iNaturalist Observation
participant CAMS as CAMS Feature
User->>iNat: New observation
iNat->>CAMS: New feature (WeedLocation and WeedVisit)
Note right of iNat: When synchroniser runs
Note right of User: Sometime later
User->>iNat: Observation identification added
Note right of iNat: No update to CAMS needed
Note right of User: Sometime later
User->>iNat: Treated ? = Yes (on same observation)
iNat->>CAMS: New WeedVisit record added to feature
Note right of iNat: When synchroniser runs
Note right of User: Sometime later
User->>iNat: Status update (on same observation)
iNat->>CAMS: New WeedVisit record added to feature
Note right of iNat: When synchroniser runs

The time that the latest observation was updated is stored in a *_time_of_last_update.txt file. When the synchronisation is rerun, it checks for observations which have been updated since this timestamp (and then updates the file with the new last update timestamp).

Scheduled workflow

The synchroniser is run regularly (currently hourly) by the synchronise-inat-to-cams workflow.

It can be triggered manually by clicking the Run workflow button on that page (assuming you are logged in and have permission to do so).

Schedule

The schedule is configured in the workflow definition. Under on: > schedule: the cron: setting defines a cron expression. For example,

- cron: '42 * * * *'

specifies that the workflow will be run at 42 minutes past each hour.

Note that the GitHub cron schedule uses the UTC timezone.

Secrets

Credentials, such as the username and password for logging on, are encrypted and stored in GitHub Secrets.

These credentials can only be read by GitHub Actions and are masked in the log files.

Environments

ArcGIS

Currently all environments are within the same ArcGIS account. The code requires two ArcGIS feature layers within this account:

Dev/Test

An expendable feature layer for development and testing of new code. Prior to running the Behaviour Driven Development (BDD) tests, a check is made that the feature layer is intended for testing (see environment.py). This ensures that we are not creating and deleting test data in production.

Production

The main feature layer containing CAMS weed data targeted by the sychroniser.

The environment is configured in the relevant workflow file.

iNaturalist

We do not have a test environment for iNaturalist, so do not perform any automated testing against iNaturalist. The operations we currently perform do not need an iNaturalist account, so are performed anonymously.

Notifications

If the workflow fails, a notification will be sent to the person who last updated the cron schedule or, if manually triggered, the person that triggered the workflow. See notifications for workflow runs for details.

Logs

Detailed logs can be viewed by clicking on the workflow run. See Using workflow run logs if you need help with this.

Timeouts

At one stage, iNaturalist had an issue reading changes which hung on the get request for 6 hours until the GitHub job timed out. To avoid this happening again we have implemented:

Workflow timeout

An overall timeout after 120 minutes configured in the workflow

timeout-minutes: 120

This should allow for large synchronisation jobs to be performed, while also reducing the overall minutes used when reads fail.

iNaturalist read timeout

An additional timeout of 120 seconds is applied to the iNaturalist read in case this hangs.

Retries

We have sometimes had intermittent issues connecting to iNaturalist or ArcGIS. To increase the chances of success, we have added retry logic to iNaturalist and ArcGIS interface methods. These are currently set to retry 3 times with a 5 second wait between retries.

Workflow minutes

The workflow is currently running under the GitHub free account, limited to 2,000 minutes/month. Since the workflow runs hourly, this equates to about 2.7 minutes per workflow run.

Most of the workflow time is spent installing cached dependencies. While our immediate dependencies currently use fixed versions, some of the transitive dependencies use version ranges, which can cause this time to escalate. It's worth keeping a periodic watch on the time taken taking by the workflows to ensure they normally complete within 2 minutes.

Time of Last Update files

The synchronisation workflow updates several files which are subsequently committed and pushed back to GitHub. These files are:

Configuration

Configuration files allow the following to be easily modified:

Taxa and places to be synchronised

The sync_configuration file determines which observations are synchronised from iNaturalist to CAMS.

An example definition is:

{
    "Old Man's Beard Free Wellington": {
        "file_prefix": "ombfw",
        "taxon_ids": ["160697"],
        "place_ids": ["6868"]
    }
}

where:

Observations that contain one of the taxon_ids within one of the place_ids will be synchronised. (Note that observations must have a location and date observed set as well as geoprivacy being set to Open for the observation to be synchronised.)

Updating existing entries

If you add a taxon or place to an existing entry, prior records for the new taxon or place will not automatically be synchronised. To force them to be synchronised, you must first delete the file_prefix_time_of_last_update.txt file (where file_prefix is replaced by the file prefix for the entry). Upon rerunning the synchronisation, all records will be resynchronised. Since the CAMS updates are idempotent, only the new entries for taxon or place will be added and existing entries won't be modified.

NOTE: any modifications to existing entries made through the CAMS app may be overwritten. It may be worth checking and/or backing up the data first in case of any issues.

Taxon mapping

The taxon_mapping file contains a mapping from the iNaturalist taxon to the CAMS taxon. Note that all taxon_ids listed in the sync_configuration file must have a taxon mapping entry.

An example definition is:

{
    "160697": "OldMansBeard",
    "285911": "CathedralBells",
    "879226": "BananaPassionfruit"
}

where:

CAMS Schema

The cams_schema file contains the expected schema of the CAMS feature layer. This is used to:

  1. Validate the schema at startup to ensure that the CAMS schema has not deviated from the expected schema. If the schema has deviated, the code will abort with an error message, allowing the code (or schema) to be corrected.
  2. Map names of values to the coded Value.

An example definition is:

{
    "WeedLocations": {
        "Date First Observed": {
            "name": "DateDiscovered",
            "type": "Date"
        },
        "DataSource": {
            "name": "SiteSource",
            "type": "String",
            "length": 39,
            "values": {
                "iNaturalist": "iNaturalist_v1"
            }
        },
        ...
    }
}

where:

Environment Variables

In addition to these configuration files, the following environment variables are needed to run the code locally (or must be set up in GitHub Secrets to run the GitHub Actions workflow):

Code

The code is written in Python 3.11.

Dependencies

The dependencies are frozen so that new transitive dependencies do not break the GitHub Actions workflows. To update the dependencies for GitHub Actions:

  1. create a fresh virtualenv locally
  2. pip install -r requirements.txt
  3. pip freeze > requirements_lock.txt

The arcgis package only supports up to Python3.11 as of 2023-09-25 (version 2.2.0 requires Python >=3.9, <3.11).

Dependencies include:

For development, we have used the free PyCharm IDE.

Folder structure

The folder structure is:

Overview

This diagram shows the flow of the synchronisation from iNaturalist to ArcGIS CAMS.

image

When the synchroniser is invoked, it:

  1. Parses the sync_configuration file to determine the synchronisations to perform. A sync configuration contains the taxa and places to be synchronised. For each sync configuration:
    1. The time of last update is read
    2. A request is made to iNaturalist for any new observations for the relevant taxa and places since the previous last update time
    3. For each new observation:
      1. The observation is read by iNaturalist_reader.
      2. This creates a complex data structure, which is flattened into an iNaturalist_observation.
      3. The translator translates the observation into a cams_feature. There is some complexity to the translation, for example:
        • some weeds are mapped at a higher level of the taxonomy than an individual species. For example, Banana Passionfruit is mapped to Section Elkea which contains a number of species and their hybrids. The translation works up the taxonomic tree until it finds a matching taxa.
        • the visit date and status are calculated dependent on the latest of the date_controlled, date_of_status_update, date_first_observed fields. The status is translated to one of the CAMS colour status fields dependent on various fields.
        • dates and times are converted from UTC to local time
      4. The cams_feature is written to the ArcGIS CAMS feature layer using the cams_writer. This uses a cams_reader to read the current record and check for differences before creating the feature and/or visit record if modified. Sometimes the changes in the iNaturalist observation are to fields that we aren't interested in and no changes need writing to CAMS.
        • String fields are truncated if they are longer than the target CAMS fields.
        • cams_writer and cams_reader delegate to cams_interface to interface with ArcGIS. This interface also checks that the fields in the CAMS feature layer and visits table are as expected (type, length etc)
      5. A summary of any changes are logged using the summary logger. This is configured in setup_logging.
    4. The updated time of last update is written to file.

Behaviour Driven Development

The project's features are described using Feature Files that are automated using Behave. Once the feature is well understood, the code to implement these features is then developed.

Explore our feature files.

The resultant reports are published as artifacts at the end of each Run Tests workflow run:

image

Unzipping the report file and opening the behave_reports.html file shows the status of each scenario:

image