MerginMaps / geodiff

Library for handling diffs for geospatial data
https://merginmaps.com
MIT License
146 stars 16 forks source link

Database defined UNIQUE or FOREIGN KEY constraint causes rebase to fail #210

Open leorudczenko opened 4 months ago

leorudczenko commented 4 months ago

This is a follow-up from https://github.com/MerginMaps/mobile/issues/2933. After further investigation into this bug, we have found that the bug exists in the rebase function from geodiff.

Summary

Our database/GeoPackage structure requires the use of the UNIQUE constraint on certain fields, specifically for primary/foreign keys. When trying to rebase a GeoPackage file with a UNIQUE constraint on any column, it fails with this error: pygeodiff.geodifflib.GeoDiffLibError: rebase. This error is then what leads to the strange conflict bug here: https://github.com/MerginMaps/mobile/issues/2933.

To Reproduce

We have created a Python script to re-produce this error repeatedly, using a minimal table definition with 2 fields:

"""
geodiff_unique_rebase_bug.py
"""
import sqlite3
from pathlib import Path

import pygeodiff
import etlhelper as etl

def main() -> None:
    # Define GeoPackage filepaths
    gpkg_dir = Path(__file__).parent
    base = gpkg_dir / "base.gpkg"
    ours = gpkg_dir / "ours.gpkg"
    theirs = gpkg_dir / "theirs.gpkg"
    reset_gpkg_files(base, ours, theirs)
    create_gpkg_files(base, ours, theirs)
    geodiff_rebase(gpkg_dir, base, ours, theirs)

def reset_gpkg_files(base: Path, ours: Path, theirs: Path) -> None:
    """
    Delete the GeoPackage files if they already exist.
    """
    for gpkg in [base, ours, theirs]:
        gpkg.unlink(missing_ok=True)

def create_gpkg_files(base: Path, ours: Path, theirs: Path) -> None:
    """
    Create 3 GeoPackage files at the given filepaths.
    All 3 of the files will have the same table created in them.
    But, for the file 'ours' and 'theirs', they will each have 1 row inserted into the table too.
    These rows will abide by the database table constraints.
    """
    # UNIQUE constraint in this CREATE TABLE statement causes geodiff.rebase to fail
    # If you re-run this script, but remove the UNIQUE constraint, it will succeed
    create_table = """
        CREATE TABLE test (
            "fid" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
            "name" TEXT UNIQUE
        )
    """

    # Add empty table to all 3
    for gpkg in [base, ours, theirs]:
        with sqlite3.connect(gpkg) as conn:
            etl.execute(create_table, conn)

            # Add 1 row to ours and theirs
            if gpkg.stem == "ours":
                etl.load("test", conn, [{"name": "our_change"}])
            elif gpkg.stem == "theirs":
                etl.load("test", conn, [{"name": "their_change"}])

def geodiff_rebase(gpkg_dir: Path, base: Path, ours: Path, theirs: Path) -> None:
    """
    Call geodiff.rebase.
    When it works, you should end up with both of the inserted rows in the file 'ours'.
    When it fails, you should see 'pygeodiff.geodifflib.GeoDiffLibError: rebase' and no conflict file is created.
    """
    geodiff = pygeodiff.GeoDiff()
    geodiff.rebase(str(base), str(theirs), str(ours), str(gpkg_dir / "conflict.gpkg"))
    print(f"Rebased files successfully into: {ours}")

if __name__ == "__main__":
    main()

Running the Script

To run the script, you need to install the following Python libraries:

Run the script with: python geodiff_unique_rebase_bug.py

Results

The script will fail at the rebase step, demonstrating the bug.

If you then remove the UNIQUE constraint from the test table definition, it will allow the script to run to completion. In this circumstance, the GeoPackage ours will end up with 2 rows as expected.

Software Versions

wonder-sk commented 3 months ago

Thanks for the detailed bug report @leorudczenko!

I have looked into this and the reason for things falling apart is this:

Now how to solve this issue: Ideally we would want to tell sqlite to temporarily disable constraints while we're doing updates (like what postgres has with SET CONSTRAINTS ALL DEFERRED), but this does not seem to be supported (only foreign keys can be deferred, and only "check" constraints can be disabled).

I can see two ways out:

The question is what to do when there would be simultaneous edits that actually break the unique constraint - when geodiff is used in context of Mergin Maps, this would mean that a user would end up being unable to sync their changes, due to the unique constraint error, and no way to fix it. This would be quite tricky to handle, to give mobile users enough information what has happened and offer some ways out - for the typical use case with field surveys, it is typically easier to let some data admin fix such issues in the data afterwards... I am keen to hear your thoughts!

leorudczenko commented 3 months ago

Thanks so much for all the information and insightful discussion @wonder-sk! It's great to finally understand what's going on with this bug 😄

I personally think the first of your two potential solutions sounds like the more complex but ideal one, where you try to apply the difference and then create a buffer of changes on a failed constraint. It seems like it would better fit the current process used to apply differences, whilst avoiding potential issues with directly modifying the database schema.

I think when there are simultaneous edits that break a constraint, it would make sense then for it to create a conflict version of the database, like Mergin Maps already does. In an ideal world, the user would also be informed that this has been done, but it isn't required. Then again, Mergin Maps currently does not display any message when a conflict occurs. However, the important difference between these 2 scenarios would be the reason for the conflict, which would be very helpful to know as a developer. Especially if it could include the exact constraint that failed.

volcan01010 commented 3 months ago

Hi @wonder-sk,

Thanks for picking this up. Although it is a bit awkward that we have constraints on the GeoPackage, I think that it is likely that others will take advantage of these features of SQLite, too.

Without being able to switch off constraints, applying the diff is a tricky problem. I have two comments:

DROP TABLE IF EXISTS test;
DROP TABLE IF EXISTS geodiff_working;

CREATE TABLE test (
    "fid" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    "name" TEXT UNIQUE
);

INSERT INTO test (fid, name) VALUES (1, 'our_change');

CREATE TEMPORARY TABLE geodiff_working AS SELECT * FROM test;

INSERT INTO geodiff_working (fid, name) VALUES (2, 'our_change');
UPDATE geodiff_working SET name = 'their_change' WHERE fid = 1;

-- Getting the data back isn't so easy...
-- We can't do this in real life because wiping the table will break foreign key
-- constraints and inserting all the records will fire insert triggers and
-- increment sequences.
DELETE FROM test;
INSERT INTO test SELECT *  FROM geodiff_working;

SELECT * FROM test;

Please let us know if there is anything that we can do to help. We can arrange a call if you would like to discuss options (and it would be good for us to learn how geodiff works internally) and we can do also test different versions of code.

volcan01010 commented 3 months ago

I had another thought for a workaround. In this case you create a stash of failed values inserting a placeholder in the failed column, then update them into the correct place at the end.

volcan01010 commented 1 month ago

We had another discussion about workarounds for this. One question that we had was why the UNIQUE constraint issue only affects us in the mobile app and not in the QGIS plugin.

volcan01010 commented 1 month ago

It seems like GDAL and QGIS are both adding support for auto-detection of UNIQUE constraints in GeoPackages, so there must be other people than us that use them. They already detect foreign key relationships.

https://github.com/qgis/QGIS/issues/57823

leorudczenko commented 1 month ago

We tried a workaround for this issue by dropping all of the UNIQUE constraints from our table definitions, however this causes SQLite to throw an error regarding a foreign key mismatch. This happens because SQLite requires that parent key columns to have a UNIQUE constraint.

The script below will fail when geodiff tries to rebase because of the UNIQUE constraint. But if you remove the UNIQUE constraint from the test_parent.uuid column, then it will result in the following SQLite error when you try to insert data: foreign key mismatch - "test_child" referencing "test_parent"

"""
geodiff_fk_rebase_bug.py
"""
import sqlite3
from pathlib import Path

import pygeodiff
import etlhelper as etl

def main() -> None:
    # Define GeoPackage filepaths
    gpkg_dir = Path(__file__).parent
    base = gpkg_dir / "base.gpkg"
    ours = gpkg_dir / "ours.gpkg"
    theirs = gpkg_dir / "theirs.gpkg"
    reset_gpkg_files(base, ours, theirs)
    create_gpkg_files(base, ours, theirs)
    geodiff_rebase(gpkg_dir, base, ours, theirs)

def reset_gpkg_files(base: Path, ours: Path, theirs: Path) -> None:
    """
    Delete the GeoPackage files if they already exist.
    """
    for gpkg in [base, ours, theirs]:
        gpkg.unlink(missing_ok=True)

def create_gpkg_files(base: Path, ours: Path, theirs: Path) -> None:
    """
    Create 3 GeoPackage files at the given filepaths.
    All 3 of the files will have the same tables created in them.
    But, for the file 'ours' and 'theirs', they will each have 1 row inserted into each table too.
    These rows will abide by the database table constraints.
    """
    # We have to setup the "uuid" column with a UNIQUE constraint to comply with SQLite database constraints for foreign key columns
    # It is not possible to define foreign key relationships in SQLite without UNIQUE constraints on the parent key column
    # Typically, the PRIMARY KEY column is used as the parent key column, where the PRIMARY KEY by default also applies the UNIQUE constraint

    # If you remove the UNIQUE constraint from the "uuid" column in the table "test_parent", it will result in the following SQLite error:
    # foreign key mismatch - "test_child" referencing "test_parent"

    create_test_table = """
        CREATE TABLE test_parent (
            "fid" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  -- This is the PRIMARY KEY for the GeoPackage
            "uuid" TEXT UNIQUE,  -- This is the FOREIGN KEY for parent/child relationships
            "name" TEXT
        )
    """
    create_test_child_table = """
        CREATE TABLE test_child (
            "fid" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
            "parent_fuid" TEXT,
            "name" TEXT,
            FOREIGN KEY("parent_fuid") REFERENCES "test_parent"("uuid")
        )
    """

    # Add both tables to all 3 databases
    for gpkg in [base, ours, theirs]:
        with sqlite3.connect(gpkg) as conn:
            etl.execute("PRAGMA foreign_keys = ON", conn)

            # Create the 2 tables
            for create_sql in [create_test_table, create_test_child_table]:
                etl.execute(create_sql, conn)

            # Add 1 row to ours and theirs
            if gpkg.stem == "ours":
                etl.load("test_parent", conn, [{"name": "our_change", "uuid": "{ac267d95-db70-4299-b019-c52599ca1e5f}"}])
                etl.load("test_child", conn, [{"parent_fuid": "{ac267d95-db70-4299-b019-c52599ca1e5f}", "name": "our_child"}])
            elif gpkg.stem == "theirs":
                etl.load("test_parent", conn, [{"name": "their_change", "uuid": "{abc43098-fe9b-4da0-b008-7518694466bb}"}])
                etl.load("test_child", conn, [{"parent_fuid": "{abc43098-fe9b-4da0-b008-7518694466bb}", "name": "their_child"}])

def geodiff_rebase(gpkg_dir: Path, base: Path, ours: Path, theirs: Path) -> None:
    """
    Call geodiff.rebase.
    When it works, you should end up with both of the inserted rows in the file 'ours'.
    When it fails, you should see 'pygeodiff.geodifflib.GeoDiffLibError: rebase' and no conflict file is created.
    """
    geodiff = pygeodiff.GeoDiff()
    geodiff.rebase(str(base), str(theirs), str(ours), str(gpkg_dir / "conflict.gpkg"))
    print(f"Rebased files successfully into: {ours}")

if __name__ == "__main__":
    main()

We require the foreign key relationships to be defined in the database because our tool depends on QGIS auto-detecting the relationships.

This issue may relate to these other issues concerning foreign key relationships: