MerginMaps / geodiff

Library for handling diffs for geospatial data
https://merginmaps.com
MIT License
150 stars 17 forks source link

Diff between 2 PostgreSQL databases returns nothing #185

Closed mdouchin closed 2 years ago

mdouchin commented 2 years ago

Hi,

I am using the Dockerfile to build and use geodiff with the PostgreSQL driver enabled, with the following steps:

# Clone repository
git clone git@github.com:MerginMaps/mergin-db-sync.git
cd mergin-db-sync

# Build docker image
docker build -t mergin_image .

# Run the drivers command
docker run -it --rm --network="host" --volume /home/mdouchin/:/home/mdouchin/ --name=mergin_container mergin_image /geodiff/build/geodiff drivers
# returns
sqlite
postgres

NB: I created a bash alias to run the docker command with the simple geodiff

Test data:

# Create the source database
dropdb geodiff_source
createdb geodiff_source
psql -d geodiff_source -c "CREATE EXTENSION postgis"

# Fill it with test data
psql -d geodiff_source -c "CREATE SCHEMA eee"
psql -d geodiff_source -c "CREATE TABLE eee.points (id serial primary key, p_name text, geom geometry(point, 2154))"
psql -d geodiff_source -c "INSERT INTO eee.points (p_name, geom) VALUES ('first', ST_MakePoint(0, 1))"
psql -d geodiff_source -c "INSERT INTO eee.points (p_name, geom) VALUES ('second', ST_MakePoint(5, 10))"

# Create the second database
dropdb geodiff_target
createdb geodiff_target
psql -d geodiff_target -c "CREATE EXTENSION postgis"

I successfully ran PostgreSQL related commands, for example:

# Get the source database schema as JSON
geodiff schema --driver 'postgres' "host=localhost port=5432 user=mdouchin password=**** dbname=geodiff_source" eee schema_pg.json

# create a copy of a PostgreSQL database "geodiff", schema "eee" to another database "geodiff_target", schema "eee"
geodiff copy --driver-1 'postgres' "host=localhost port=5432 user=mdouchin password=**** dbname=geodiff_source" --driver-2 'postgres' "host=localhost port=5432 user=mdouchin password=**** dbname=geodiff_target" eee eee

# This works well: the target database has been populated with the eee schema and the points table
md ~ $ psql -d geodiff_target -c "SELECT * FROM eee.points"
 id | p_name |                        geom                        
----+--------+----------------------------------------------------
  1 | first  | 01010000206A0800000000000000000000000000000000F03F
  2 | second | 01010000206A08000000000000000014400000000000002440
(2 lignes)

Then I inserted one more feature in the unique table eee.points in the geodiff database, and try to create a diff between the two databases.

# Insert one more feature in the source database
psql -d geodiff_source -c "INSERT INTO eee.points (p_name, geom) VALUES ('third', ST_MakePoint(3, 5))"

# Get the diff
geodiff diff --json --driver-1 'postgres' "host=localhost port=5432 user=mdouchin password=**** dbname=geodiff_source" --driver-2 'postgres' "host=localhost port=5432 user=mdouchin password=**** dbname=geodiff_target" eee eee
# Returns
{
  "geodiff": []
}

The returned JSON is "empty".

I also tried with the same source database, but with a SQLite database as the target database, and it worked: the diff corresponded to the expected differences between the 2 databases:

# Copy the eee schema of the source PostgreSQL database to the GeoPackage geodiff_target.gpkg
geodiff copy --driver-1 'postgres' "host=localhost port=5432 user=mdouchin password=**** dbname=geodiff_source" --driver-2 'sqlite' "" eee /home/mdouchin/geodiff_target.gpkg

# The GPKG data feels good
sqlite3 geodiff_target.gpkg
sqlite> SELECT * FROM points;
1|first|GP
2|second|GP
3|third|GP
exit

# Insert a new line in the source points table
psql -d geodiff_source -c "INSERT INTO eee.points (p_name, geom) VALUES ('fourth', ST_MakePoint(10, 2))"

# Check the diff
geodiff diff --json --driver-1 'postgres' "host=localhost port=5432 user=mdouchin password=**** dbname=geodiff_source" --driver-2 'sqlite' "" eee /home/mdouchin/geodiff_target.gpkg 

which returns

{
  "geodiff": [
    {
      "changes": [
        {
          "column": 0,
          "old": 4
        },
        {
          "column": 1,
          "old": "fourth"
        },
        {
          "column": 2,
          "old": "R1AAAWoIAAABAQAAAAAAAAAAACRAAAAAAAAAAEA="
        }
      ],
      "table": "points",
      "type": "delete"
    }
  ]
}

I was wondering if it was possible to use geodiff for a PostgreSQL/PostgreSQL diff tool, or is it more "field oriented" and designed for GeoPackage as the pivot format ?

Thanks a lot for your great work on this tool !

PeterPetrik commented 2 years ago

Hi @mdouchin and thanks for great & reproducible ticket! I think in principle it should work, at the end, db-sync works both ways, so if you change the PG database it propagates too back to mergin.

do you have particular use-case for this workflow, so we know if we need to prioritize this bugfix? It would be nice to have a catch-up with you, can you drop me a mail on peter.petrik@lutraconsulting.co.uk? Or do you attend FOSS4G Firenze any chance?

mdouchin commented 2 years ago

Hi @PeterPetrik I will attend the Foss4G in Firenze and would be glad to share ideas around geodiff use cases.

I have 2 use cases in mind:

I will email you with more context. Let's stay connected and see us in Firenze !

mdouchin commented 2 years ago

I have checked the behaviour if I use only one database with a copy made in another schema, and it works well. So the issue seems to occur when using 2 separated PostgreSQL databases

# Reinit the database geodiff_source (see first message above)
#
# Copy content from the "eee" schema to a new schema "eee_copy"
geodiff copy --driver 'postgres' "host=localhost port=5432 user=mdouchin password=**** dbname=geodiff_source" eee eee_copy

# Insert new line
psql -d geodiff_source -c "INSERT INTO eee.points (p_name, geom) VALUES ('third', ST_MakePoint(3, 5))"

# Get the diff
geodiff diff --json --driver 'postgres' "host=localhost port=5432 user=mdouchin password=**** dbname=geodiff_source" eee_copy eee
# returns correct JSON

returns

{
  "geodiff": [
    {
      "changes": [
        {
          "column": 0,
          "new": 3
        },
        {
          "column": 1,
          "new": "third"
        },
        {
          "column": 2,
          "new": "R1AAAWoIAAABAQAAAAAAAAAAAAhAAAAAAAAAFEA="
        }
      ],
      "table": "points",
      "type": "insert"
    }
  ]
}
wonder-sk commented 2 years ago

hi @mdouchin from a quick look into the code it seems there is a logic mistake in GEODIFF_createChangesetDr() that allows to short circuit possibly costly comparison:

  if ( strcmp( driverSrcName, driverDstName ) == 0 )
  {
    return GEODIFF_createChangesetEx( contextHandle, driverSrcName, driverSrcExtraInfo, src, dst, changeset );
  }

However in this comparison we should be also checking that driverSrcExtraInfo and driverDstExtraInfo are the same, to fix this bug.

Comparison of data between two different postgres databases is not particularly efficient - geodiff will export data of schemas to temporary sqlite files and then compare them.

For your use case to synchronize two databases, even if cross-database diff was efficient, you still need some common ancestor - if both databases change, to know what changes to move across (assuming both databases can change - if not then things are simpler). The best approach using geodiff would be to have two schemas in each of the databases:

Then whenever you sync, you would create diff between "base" and "normal" schemas in each database, apply the diff on the other database, and finally update the "base" schemas, something like this:

  1. create diff: A_normal - A_base = A_diff
  2. create diff: B_normal - B_base = B_diff
  3. if both diffs contain changes: rebase A_diff on top of B_diff (or vice versa)
  4. apply diffs to A_normal and B_normal and A_base and B_base

If only one database can change, things are easier, and only one "base" schema is needed during sync:

  1. create diff: A_normal - A_base = A_diff
  2. apply diff A_diff to B
  3. apply diff A_diff to A_base
mdouchin commented 2 years ago

Hi @PeterPetrik @wonder-sk Thank you for your detailed explanation, completed by our discussion during this Foss4G2022 ! I will do some more tests, and report back if needed. Feel free to close this issue if necessary

PeterPetrik commented 2 years ago

Thanks, it is still a bug so lets keep this one open

mdouchin commented 2 years ago

@wonder-sk I need some more help here:

Then whenever you sync, you would create diff between "base" and "normal" schemas in each database, apply the diff on the other database, and finally update the "base" schemas, something like this:

1. create diff: A_normal - A_base = A_diff

2. create diff: B_normal - B_base = B_diff

3. if both diffs contain changes: rebase A_diff on top of B_diff (or vice versa)

4. apply diffs to A_normal and B_normal and A_base and B_base

In the step 3/ if I use the rebase-diff method, the diff are well rebased, but if there is a conflict, it is not "removed" from the rebased diff. Is it done in purpose ? Which means geodiff client should rebase, then check the conflict file, and remove the conflicts listed in the conflict file from the rebased diff ?

Or should I prefer to just use the geodiff rebase-db command which will only aplly the not conflicting changes ?

mdouchin commented 2 years ago

Do you prefer me to create a new issue for my questions regarding PostgreSQL database synchronisation (comment above ) for more clarity ?

wonder-sk commented 2 years ago

Do you prefer me to create a new issue for my questions regarding PostgreSQL database synchronisation (comment above ) for more clarity ?

It's okay to keep the discussion here, sorry I just forgot to answer the questions above...

n the step 3/ if I use the rebase-diff method, the diff are well rebased, but if there is a conflict, it is not "removed" from the rebased diff. Is it done in purpose ? Which means geodiff client should rebase, then check the conflict file, and remove the conflicts listed in the conflict file from the rebased diff ?

That's intentional... The idea is that whenever there is an edit conflict while rebasing, we may not be able in a good position to resolve the conflict with a proper 3-way merge GUI (because rebase may be done on a phone with a user with limited knowledge, or it is done in an automated system like db-sync which we don't want to pause because of a minor conflict). So we simply accept the latest change as the winner of the edit conflict ("our" version wins), and this automatic conflict resolution info is stored to a new JSON file, which gets uploaded to the project directory, so the project admin can inspect the automatic conflict resolution and possibly override the values... Does that make sense?

mdouchin commented 2 years ago

Thanks a lot for your answer @wonder-sk

This makes sense ! Since this kind of conflict should happen rarely, it is indeed a robust and fast resolution method. I will propose a PR with some explanation whenever I find enough time to correctly write it down.

Do you think geodiff could provide a "conflict" merging system, for example to merge all the conflicts which have happened between 2 dates ? For example, if you have synchronized 10 times between the 2 databases, and once per week/month, you need to have a look at the conflicts, but in a simple way. It would be handy to read one single conflict file instead of possibly 3 or 4. I could propose a PR for this "conflict merging tool", even if my C++ coding skills are a bit rusty... I could also propose a Python QGIS processing alg which will do the merge and create a vectory layer to visualize the conflicts ?

wonder-sk commented 2 years ago

Do you think geodiff could provide a "conflict" merging system, for example to merge all the conflicts which have happened between 2 dates ? For example, if you have synchronized 10 times between the 2 databases, and once per week/month, you need to have a look at the conflicts, but in a simple way. It would be handy to read one single conflict file instead of possibly 3 or 4. I could propose a PR for this "conflict merging tool", even if my C++ coding skills are a bit rusty... I could also propose a Python QGIS processing alg which will do the merge and create a vectory layer to visualize the conflicts ?

With this conflict merging system, do you mean a GUI which would show all the recorded conflicts (on a map or a table or both) and it would allow inspection of each conflict, with some options to change the conflict resolution? There are certainly more ways how to do that - probably the easiest would be a QGIS processing alg as you mentioned, or it could be a complete GUI within QGIS (the Mergin plugin already comes with a GUI to compare changes - see https://github.com/MerginMaps/qgis-mergin-plugin/pull/397 - only missing the bits related to conflict resolution).

mdouchin commented 1 year ago

Thanks !