climsoft / climsoft-web

Climsoft web application
MIT License
0 stars 6 forks source link

Database Technology #35

Open Patowhiz opened 3 months ago

Patowhiz commented 3 months ago

Overview

Climsoft desktop utilises MariaDB for managing climate data. However, to accommodate our evolving requirements, including the need to store data properties in JSON format and the anticipated increase in spatial data related to climate and hydrology, we are considering a migration to PostgreSQL with the PostGIS extension. This shift is aimed at harnessing the superior spatial and JSON data handling capabilities of PostgreSQL and PostGIS.

MariaDB, while effective for time series data needs, falls short in managing complex JSON and spatial data structures that the database will be expected to store. The advanced indexing and querying capabilities needed for climate and hydrology datasets necessitate a more robust solution.

Rationale and Benefits

I propose that we migrate our database management system to PostgreSQL with the PostGIS extension. This change would bring the following benefits:

  1. Advanced JSON Handling: PostgreSQL's JSONB data type offers superior capabilities for storing, indexing, and querying JSON data, making it ideal for handling the structured and semi-structured data prevalent in our records, especially for tracking changes.

  2. Superior Spatial Capabilities: PostGIS extends PostgreSQL with comprehensive support for geographic objects, facilitating sophisticated spatial queries and analyses. This enhancement is crucial for effectively managing the spatial aspects of our climate and hydrology data, streamlining processes, and eliminating the need for additional spatial data handling solutions.

  3. Robust Indexing Options: The extensive indexing options provided by PostgreSQL and PostGIS, such as GIST and GIN indexes, are tailor-made for efficient retrieval of spatial and JSON data. This improvement will directly enhance the performance and responsiveness of our data querying and analytical operations.

  4. Community and Ecosystem: PostgreSQL, being one of the most popular open-source relational database systems, has a large and active community. This ensures continuous improvements and support, along with a plethora of tools and integrations available, which can be beneficial for our system's maintainability.

Request for Comments

I invite feedback on this proposal. Your insights and suggestions will be invaluable.

isedwards commented 3 months ago

@Patowhiz - I completely agree with your analysis. Also have a look at timescaledb (a postgres extension for time-series data).

If you're planning significant changes to the Climsoft data model - and you're also considering changing the database technology to Postgres - I wonder whether you may also be open to migrating users completely to the WMO Climate Data Model Standard (which would move Climsoft much further towards WMO compliance than any other CDMS implementation)?

@david-i-berry

Patowhiz commented 3 months ago

@isedwards, thank you for your input.

We are in the process of internally discussing the modifications to our model. Although there have been significant additions, these changes are an extension of what exists in version 4. We have not made any drastic alterations that would render the prior experience with the current model obsolete.

I have been examining the WMO climate data model version 1 release and identified certain features that I have incorporated, am considering incorporating, and need to evaluate for potential adoption. Given the substantial differences in the schema, I recommend we consider proceeding with it after conducting test cases against the database using the OpenCDMS application, aligning with the OpenCDMS roadmap.

In our search for a suitable database technology, I explored timescaledb. Its approach to timeseries data, which offers enhanced performance, is commendable. However, it lacks the spatial capabilities that are crucial for our immediate needs. Ideally, a PostgreSQL extension combining the features of timescaledb and PostGIS would be perfect. Unfortunately, such a solution does not yet exist, and given our current user base and anticipated product development, PostGIS seems to offer more immediate advantages.

Additionally, I am contemplating the use of timescaledb for ancillary data storage, such as tracking user activity audits, separate from our main database.

isedwards commented 3 months ago

Thank you @Patowhiz - we would be very happy to help with test cases mapping both the current Climsoft data model and also any planned changes to the WMO Climate Data Model standard and I agree that this would be the most sensible next step.

Although Timescale and PostGIS are both available packaged as separate installable solutions, under the hood they are both just PostgreSQL extensions and so they can both be used together in a standard Postgres installation along with any other extensions that are required.

The TimescaleDB High Availability docker image is an example that includes PostgreSQL with the PostGIS (spatial), TimescaleDB (time-series) and Patroni (high availability) extensions already installed an configured: https://docs.timescale.com/self-hosted/latest/install/installation-docker/

Patowhiz commented 3 months ago

@isedwards I will delve into the details of timescaledb-ha, evaluate its implementation and usability in regards to our use case and user base.