EthanWaters / reusable_digital_workflow

0 stars 0 forks source link

Control Data Reusable Workflow

Lifecycle: experimental

1.0 Overview

This codebase was developed as part of the COTS Control Innovation Program, project R-02. The overall purpose of this codebase is to clean, wrangle and perform geospatial analysis on control program data from GBRMPA to produce a standardised output for utilisation in research or decision support tools. source.R defines a large set of functions for this and generally serves one of the following purposes:

  1. Data Transformation
  2. Perform Error Checking & Processing
  3. Site Assignment to control data if applicable
  4. Aggregation and Export

Functions defined in source.R are then utilisesd to produce several application specific reusable workflows

  1. process_control_data_research_output.R See Section 3.1
  2. ingest_control_data_export_to_app.R See Section 3.2

1.1 Term Definitions

This section defines several terms utilized throughout the documentation to ensure clarity.

2.0 Installation & Requirements

This codebase is designed to be automated with Azure and not run locally.

Docker containers have been produced during the development of this code base to ensure that the client environment remains consistent with the dev environment, see section 2.1 for instructions. See section 2.2 & 2.3 for details of all packages installed in dev environment. There is no guarantee that the docker images are up-to-date.

This requires Docker version 24.0.6:https://www.docker.com/products/docker-desktop/

Although not recommended, the scripts can be executed locally after running the setup scripts. On windows execute the setup and dependencies batch files. On Linux run the setup and dependencies shell files

2.2 R Environment Information

2.3 Required Packages

Package Version
tools 4.2.1
installr 0.23.4
readxl 1.4.1
sets 1.0-21
XML 3.99-0.13
methods 4.2.1
xml2 1.3.3
rio 0.5.29
dplyr 1.0.10
stringr 1.4.1
fastmatch 1.1-3
lubridate 1.8.0
rlang 1.1.0
inline 0.3.19
purrr 0.3.4
jsonlite 1.8.7
sf 1.0-14
leaflet 2.1.2
raster 3.6-23
terra 1.7-39
units 0.8-0
tidyverse 1.3.2
tidyr 1.2.0
lwgeom 0.2-13
stars 0.6-4
stringr 1.4.1
furrr 0.3.1
foreach 1.5.2
doParallel 1.0.17
DBI 1.1.3

3.0 Reusable Workflow Details

3.1 Reusable Workflow - Process Control Data For Research

This R code defines a data processing pipeline that imports, formats, and verifies control data for research purposes. This process creates a metadata report to document pipeline outcomes. The main() function is the entry point of the pipeline. It takes as inputs the paths to the legacy data, new data ,KML data, and JSON configuration file. The control program data can then be assigned to the nearest cull sites.

3.1.1 Data Transformation

While an ideal scenario would involve a fully dynamic system capable of automatically determining mapping transformations from one version of a data set to the next, this proved unattainable due to the overlapping use of names in the new GBRMPA database with the old data set in a different context. To address this challenge, a compromise between modularity and robustness was sought. Instead of hard-coding numerous transformations, a solution was implemented using JSON configuration files to specify transformations which are then checked against the input with NLP techniques and dynamically changed to ensure semantic differences can still be effectively mapped. This approach allows for flexibility in handling future datasets. The configuration files mean that any dataset can specify a configuration file and then utilise the work flow to ensure consistent data output.

3.1.2 Error Checking & Discrepancy Detection

Error checking is independent of discrepancy detection. These functions interpret the data and are flagged as errors is they are likely to be inappropriate for use in analysis based on advice from Dr Cameron Fletcher. No data is ever removed.

Discrepancy Detection provides the opportunity to identify changes in a specific row of data. It is not possible to know if a change is a mistake or QA so any changes that alter an error free data point to one containing an error, the original row will be utilised. In all other situations the new row will be utilised.

What Denotes An Error ?
Discrepancy Detection - Decisions & Their Philisophy

3.1.3 Site Assignment

The method traditionally employed for the assignment of control data observations to specific geographical regions was valuable for understanding ecological patterns across various reef environments. However, the method's initial implementation relied on a Mathematica script, which introduced challenges of accessibility due to the proprietary nature of Mathematica software. This limitation not only hindered the wider adoption of the technique but also raised concerns about long-term sustainability and data processing bottlenecks. To overcome these hurdles and enhance the method's usability, we undertook the task of reconstructing the approach with open-source programming language R. This transformation aims to render the method more accessible, enabling researchers to employ it without the constraints posed by proprietary software. Our reimagined implementation closely follows the original approach, allowing us to efficiently process observations and alleviate potential bottlenecks associated with external dependencies, ensuring a more streamlined data analysis workflow. The R implementation of Dr Cameron Fletcher's site assignment was the accurate method for site assignment out of those tested.

Pre-processing

Steps were then taken to reduce the computational complexity of the calculations through the simplification of the intricate polygonal shapes. The process implemented Ramer-Douglas-Peucker algorithm to obtain an adaptive approximation of a complex polygons while maintaining their essential characteristics based on a predetermined threshold of $10^{-5}$.

Spatial Analysis

The bounding boxes of each reef layer are extended by 0.003 degrees, roughly equivalent to 300 meters. The initial objective is to ensure that the bounding boxes encompass the entirety of the reef polygons, incorporating a buffer zone of suitable dimensions. This buffer serves the purpose of accommodating the meandering trajectory of manta tows, which tend to fluctuate in proximity to the reef margins. Achieving a delicate equilibrium, the buffer must be substantial enough to avoid overlap between reefs and to capture most manta tows, while avoiding computational overload. The approach also seeks to align with the practices of GBRMPA (Great Barrier Reef Marine Park Authority), wherein manta tows are assigned to sites based on proximity conditions. To maintain fidelity with the GBRMPA framework, the buffer is set at 0.003 degrees, a value that ensures consistency in proximity while retaining computational efficiency.

The expansion of the bounding boxes is coupled with an iterative process of rasterization, resulting in a raster for every reef layer. These rasters can be used for subsequent spatial analyses if desired.

To calculate the distance between a point and a polygon, the function st_distance from the sf package was utilized and can perform the calculation with either Euclidean or great circular distance. Euclidean distance will be utilized for comparison but accuracy can be improved in future implementations with the use of great circular distance. Nothing in the code or documentation indicated that the assignment of a pixel was dependent on the assignment of any other pixel. The assigned rasters undergo a transformation, yielding a set of rasters, each corresponding to a distinct reef.

Manta tow centroids are transformed into point representations. Iterating through the set of rasters, the tow points are filtered based on the reef name of the raster. The value of the raster at each centroid point is extracted and the results merged with the manta tow data input.

3.1.4 Export Data

Output locations are defined in the configuration files and will be created if they do not already exist. Any output will be saved with the naming convention: Keyword%Y%m%d%H%M%S.`File extension. Do NOT remove data outputs, simply take a copy. Previous outputs are utilised to reduce processing and reduce errors.

3.2 Reusable Workflow - Ingest Control Program Data

This R code defines a data processing pipeline that ingests JSON exports from GBRMPA owned PWAs then formats, verifies and exports the data for utilization in the Cots Control Centre Decision Support Tool. The main() function is the entry point of the pipeline and requires a list of JSON files to ingest, a path to the config file, and a connection string to connect to the database. This workflow was produced so that previous

4.0 Configuration files

Configuration files should not be altered, instead new alternative configuration files should be produced. Config files exist for both workflows that specify expected column transformations, new columns required, their default values and data types. Other config files exist to map database column names to research output column names to reuse aspects of the codebase.

5.0 Code Documentation

Function: main(new_path, configuration_path, kml_path, leg_path)

Function: import_data(data, index=1)

Function: get_datetime_parse_order()

Function: contribute_to_metadata_report(data, key="Warning")

Function: get_vessel_short_name(string)

Function: get_file_keyword(string)

Function: append_to_table_unique(con, table_name, data_df)

Function: get_id_by_cell(con, table_name, search_column, search_term)

Function: get_id_by_row(con, table_name, data_df)

Function: get_voyage_dates_strings(strings)

Function: get_app_data_database(con, control_data_type)

Function: separate_date_time(date_time)

Function: get_reef_label(names)

Function: get_start_and_end_coords_research(start_lat, stop_lat, start_long, stop_long)

Function: get_start_and_end_coords_app(start_lat, stop_lat, start_long, stop_long)

Function: get_start_and_end_coords_base(start_lat, stop_lat, start_long, stop_long)

Function: get_feeding_scar_from_description(names)

Function: get_worst_case_feeding_scar(scars)

Function: get_coral_cover(coral)

Function: get_median_coral_cover(coral)

Function: missing_reef_information(data, columns, test_value = NA)

Function: assign_missing_site_and_reef(transformed_data_df, serialised_spatial_path, control_data_type)

Function: site_numbers_to_names(numbers, reef_names)

Function: aggregate_culls_site_resolution_research(data_df)

Function: aggregate_culls_site_resolution_app(data_df)

Function: aggregate_manta_tows_site_resolution_app(data_df)

Function: aggregate_manta_tows_site_resolution_research(data_df)

Function: separate_control_dataframe(new_data_df, legacy_data_df)

Function: separate_new_control_app_data(new_data_df, legacy_data_df)

Function: flag_duplicates(new_data_df)

Function: compare_discrepancies(new_data_df, legacy_data_df, discrepancies)

Function: map_column_names(column_names)

Function: set_data_type(data_df, mapping)

Function: matrix_close_matches_vectorised(x, y, distance)

Operator: %fin%

Function: vectorised_separate_close_matches(close_match_rows)

Function: rec_group(stack, m2m_split, groups, group)

Function: verify_RHISS(data_df)

Function: verify_voyage_dates(data_df)

Function: verify_percentages(data_df)

Function: verify_na_null(data_df)

Function: verify_integers_positive(data_df)

Function: remove_leading_spaces(data_df)

Function: verify_coral_cover(data_df)

Function: verify_cots_scars(data_df)

Function: verify_cohort_count(data_df)

Function: find_one_to_one_matches(close_match_rows)

Function: verify_entries(data_df, configuration)

Function: verify_lat_lng(data_df, max_val, min_val, columns, ID_col)

Function: verify_scar(data_df)

Function: verify_tow_date(data_df)

Function: get_new_field_default_values(data_df, new_fields)

Function: transform_data_structure(data_df, mappings, new_fields)

Function: assign_nearest_site_method_c

Function: get_centroids(data_df, crs, precision=0)

Function: find_recent_file(directory_path, keyword, file_extension)

Function: save_spatial_as_raster(output_path, serialized_spatial_path)

Function: get_spatial_differences(kml_data, previous_kml_data)

Function: compute_checksum(data)

Function: assign_raster_pixel_to_sites_parallel(kml_data, layer_names_vec, crs, raster_size, x_closest=1, is_standardised=0)

Function: assign_raster_pixel_to_sites_single(raster, site_poly, crs, x_closest)

Function: assign_raster_pixel_to_sites_non_parallel(kml_data, layer_names_vec, crs, raster_size, x_closest=1, is_standardised=0)

Function: assign_raster_pixel_to_sites(kml_data, layer_names_vec, crs, raster_size, x_closest=1, is_standardised=0)

Function: site_names_to_numbers(site_names)

Function: simplify_reef_polyogns_rdp(kml_data)

Function: polygon_rdp(polygon_points, epsilon=0.00001)

Function: rdp(points, epsilon=0.00001)

Function: perpendicularDistance(p, A, B)

Function: simplify_kml_polyogns_rdp(kml_data)

Function: simplify_shp_polyogns_rdp(shapefile)

Function: find_largest_extent(kml_data)

Function: standardise_extents(kml_data)

Function: create_raster_templates(extents, layer_names_vec, crs, raster_size=150)

Function: rasterise_sites(kml_data, is_standardised=1, raster_size=150)

Function: rasterise_sites_reef_encoded(kml_data, layer_names_vec, is_standardised=1, raster_size=150)

Function: xth_smallest(x, x_values)

Function: contribute_to_metadata_report(key, data, parent_key=NULL, report_path=NULL)

Function: update_config_file(data_df, config_path)

Function: map_new_fields(data_df, new_fields)

Function: map_all_fields(data_df, transformed_df, mappings)

Function: map_data_structure(data_df, mappings, new_fields)

Function: extract_dates(input)