JabRef / jabref

Graphical Java application for managing BibTeX and biblatex (.bib) databases
https://devdocs.jabref.org
MIT License
3.6k stars 2.56k forks source link

Shared data synchronization #7618

Closed tobiasdiez closed 1 year ago

tobiasdiez commented 3 years ago

With the upcoming implementation of an online version for JabRef, we face the issue of synchronizing data across multiple devices (and eventually different users).

Setting

Assumptions

JabRef's requirements are summarized as follows:

Algorithm:

For each entry, we store the following data on the client side:

On server side, the "shared id" and a "server version number" is stored.

Then the sync operations are implemented as follows:

As an example: Server version Client version Local version State Operation
2 1 1 Server entry has been updated; no update of client entry Replace client entry with server entry
1 1 2 Client entry has been updated; no update of server entry Replace server entry with client entry
2 1 2 Client and server entry have been updated Let user resolve conflict

Details and overview of (different) options

Lock systems

One node acquires a lock to write changes. All other write changes that happened during the same time are rejected or marked as conflicts that need be resolved by the user. Here we should favor Optimistic Offline Lock (always allow users to make changes, but commit to server requires lock) over pessimistic locks (only allow writing when one has a lock) since the chance for conflicts is low (A1) and users should always be able to make edits even when offline (A4). Optimistic offline lock is the approach we use currently for for sync with shared databases. In particular, we have a "version" counter that is incremented upon every write process. If the user's version counter is smaller than the remove version counter, then this is an indication that there was an additional update to the remote database in meantime and the local change is rejected.

The current implementation may run into problems when the server is offline (if I'm not mistaken). For example, suppose that two users have the same entry with version counter zero. If user 1 changes the entry while being offline, and the server version is updated by user 2 in the meantime (thus version 2 counter is incremented), then version 1 < version 2 indicating that the entry of user 1 needs to be replaced by the remote version. But this would overwrite the changes of user 1.

In general, I have the impression that optimistic locks are great for handling concurrent transactions (two users trying to update the same entry at the same time), but has shortcomings when used as a data synchronization protocol.

Optimistic replication

Locking (in the way we use it) ensures that there is essentially only one valid replica (the server version) and all user copies are almost immediately synced, leading to essentially one copy of the data across multiple nodes. In contrast, optimistic replication accepts that for some short time there might be divergent replica that only eventually converge to an end state. Having divergent versions of data seems to be a realist scenario as users may work offline for some time (A6). The goal of optimistic replication is to have eventual consistency, which should be sufficient for our purposes.

There are different options varying along different axes.

Old proposed algorithm Every node has a vector clock (for each entry) and increments its own time once the entry is modified. Every update message from the server include the vector clock, which is then used to determine the merge strategy by comparing it with the local vector clock:

When the user reconnects and wants to update (pull sync), then she sends all (shared) ids of entries and the local vector clock to the server. The response will be all entries with more recent times (in any of the clocks). This is based on SData description.

One question I still have is how to store the vector clock (in the entry?) and name the nodes (user + device name?).

Other approaches:

References:

koppor commented 3 years ago

(A4). Optimistic offline lock is the approach we use currently for for sync with shared databases. In particular, we have a "version" counter that is incremented upon every write process. If the user's version counter is smaller than the remove version counter, then this is an indication that there was an additional update to the remote database in meantime and the local change is rejected.

The change is not rejected, there is a merge. In case user1 modifies fieldA and user2 modifies fieldB, both updates go in. If both modify fieldA, there is the "resolve conflicts" dialog shown. Similar to git, if one modifies the same line of code.

One could implement something more intelligent here, if domain-specific knowledge is used. For instance, if an author is added by user1 and user2, the update algorithm could just add the two authors.

The current implementation may run into problems when the server is offline (if I'm not mistaken). For example, suppose that two users have the same entry with version counter zero. If user 1 changes the entry while being offline, and the server version is updated by user 2 in the meantime (thus version 2 counter is incremented), then version 1 < version 2 indicating that the entry of user 1 needs to be replaced by the remote version. But this would overwrite the changes of user 1.

No. See above. JabRef implements the above merge algorithm.

It does not work well if two persons are working on the same field at the same time: The user always gets the merge entries dialog presented.

  Using GraphQL subscriptions, it should be straightforward to implement a broadcast system for updates (while the user is online). Supplemented by a pull after times of being offline (e.g. JabRef start, connection loss).

We implemented that in our Postgres setting.

* _Maintaining a "happens-before" relationship_

This is, more or less, optimistic offline lock.

@JabRef/developers What do you think?

Stick with optimistic offline lock. It seems it works very well of your described cases. Furthermore, it is very simple to implement. It works both in a "real time" data synchronization and also in an offline setting.

tobiasdiez commented 3 years ago

Thanks @koppor, apparently I misread the code a bit. So what happens if the user is offline, changes something, and later goes online. When is the version counter incremented, and how do you recognize that there were changes to the local database in the last session that need to be pushed to the server?

koppor commented 3 years ago

Side comment: We can try the effects with a shared Postgres instance. See https://docs.jabref.org/collaborative-work/sqldatabase#try-it-out for details.

I assume the tool is not crashing. (If we discuss crashing, we need a local database storing the local id). I hope, I recall everything correctly; otherwise, I would need to buy the POSA book).

The secret sauce is that the client stores the version number of the record when it received the record from the server (call it first-number). The client increases that number by one when writing. The write is done in one transaction:

In case another client pushed inbetween, the "current number of server" is not equal to "first-number". Thus, the client needs to retreive the data (update the first-number), perform a merge, and retry.

koppor commented 3 years ago

All records need to be updated where first-number != local-number. (Where local-number is first-number++ when the record is updated locally)

koppor commented 3 years ago

Small note: Back in 2016, we did not have the concept of ADRs at hand. Everything discussed was between @obraliar and me. This synchronization concept was developed in the "StuPro 2015/2016". - It just came into my mind that the key/value field storage is one consequence out of that: With having a version on each field, JabRef knows whether the field was updated remote. Thus, we know whether we have to update a field. Thus, it is possible to work on the same entry. Since we did not implement "smart merge" on a field basis, one gets the "merge entries" dialog if two users concurrently work on one field.

Note that Postgres and Oracle support the life synchronization only (because they offer a pub/sub system). MySQL does not offer pub/sub. Thus, the user has to poll changes regularly in the case of MySQL.

tobiasdiez commented 3 years ago

Thanks Olly. I think I understood the algorithm. But I have trouble seeing how it works if not all changes are immediately pushed to the server. Start with some entry, at version = 0. Say you are offline and remove a field, close JabRef (version is still 0 since it was not yet transmitted to the server, right?). In the meantime someone else modifies the entry on the server, so that there version = 1. Now you start JabRef in online mode. The server notifies you about the update, and the algorithm recognizes that 0 = local.version < server.version = 1, so the server version of the deleted field is used (right?). But the "correct" behavior would be to delete the deleted field and merge the other changes. So how does one recognize that locally the field was removed and not that the server version had the field added? In more theoretical terms, the problem is that a Lamport clock only provides a "has happen before" partial order. I.e. if version counter A is smaller than version counter B, then you know that A didn't happen after B. This is perfect for the update scenario you described above, where we can use it to detect updates to the server replica. However, it is not possible to conclude from versionA < versionB, that B happened after A. See https://en.wikipedia.org/wiki/Lamport_timestamp#Implications.

The easiest solution (and if I understood it correctly, then one can actually proof that it is the minimal data structure doing this job) is to keep track of the local version as well, i.e. have an additional counter for each node that is increased when the local replica is modified. This is exactly the version / vector clock approach suggested above. This would be only a small modification of the offline lock algorithm to gracefully handle offline scenarios as well.

tobiasdiez commented 3 years ago

I thought a bit more about it and came to the realization that the "vector clock" solution is a bit more complicated than what we need. If we have a star architecture, then every sync goes via the central server. Thus, a client doesn't need to know the local version of another client (which is what the vector clock gives). Instead, it suffices for the client to keep track of local updates relative to the server version. Thus my new proposal would be:

Then the sync operations can be implemented as follows:

In addition, the server, local and client version number need to be updated accordingly after each of these operations. Server version Client version Local version State Operation
2 1 1 Server entry has been updated; no update of client entry Replace client entry with server entry
1 1 2 Client entry has been updated; no update of server entry Replace server entry with client entry
2 1 2 Client and server entry have been updated Let user resolve conflict

If client and local version numbers are stored in the entry, then A7 is supported. However, manual changes to the bibtex file are not supported (except if the user also increases the local version number).

What do you think @koppor.

koppor commented 3 years ago

What you are explaining in the table, is the optimistic offline lock.

(Assumption: "local version" = ".bib" file, Then, it is easy, too)

As soon as the "client" is started, it knows the version of the .bib file when it was last synched. Then, it loads the local bib file. In case of any differences, it increases the version number. Then, it starts synchronizing with the server.

Cold start: The client is not aware of anything. In case the bib file already exists on the server, a full comparison has to be made.

Please note that we put the version number on fields thus, users can edit the same entry. The merge-entries-dialog only pops up if they touch the same field.

koppor commented 3 years ago

I found our implemenatation: org.jabref.logic.shared.DBMSSynchronizer#synchronizeLocalDatabase, called from org.jabref.logic.shared.DBMSSynchronizer#pullChanges.

The tests are in org.jabref.logic.shared.SynchronizationTestSimulator.

Oh, wow, with that in mind, one has "just" to implement org.jabref.logic.shared.DatabaseSynchronizer for synchronization with JabRef online. And everything works.

tobiasdiez commented 3 years ago

You are right, the algorithm is very similar to the optimistic lock. However, there you only have one "version" that is incremented upon change and used for the comparison. This works perfectly when you always have connection to the server and can directly push changes. However, when you are offline and change an entry this may lead to problems, as described above:

Start with some entry, at version = 0. Say you are offline and remove a field, close JabRef (version is still 0 since it was not yet transmitted to the server, right?). In the meantime someone else modifies the entry on the server, so that there version = 1. Now you start JabRef in online mode. The server notifies you about the update, and the algorithm recognizes that 0 = local.version < server.version = 1, so the server version of the deleted field is used (right?). But the "correct" behavior would be to delete the deleted field and merge the other changes. So how does one recognize that locally the field was removed and not that the server version had the field added?

That's why I proposed to have a "local version" that keeps track of unsynced local changes (you can see this as another optimistic lock).

As soon as the "client" is started, it knows the version of the .bib file when it was last synched.

Keeping a copy of the last synced state would be another solution for this issues, indeed. Is this already implemented?

Thanks also for the input of field vs entry sync. Need to think about this in more detail. What are the advantages of the field sync in your opinion?

koppor commented 3 years ago

Start with some entry, at version = 0. Say you are offline and remove a field, close JabRef (version is still 0 since it was not yet transmitted to the server, right?).

The client has to be smart enough to note that it locally changed something. This is what you described by "keeping a copy of the last synced state".

I agree, we don't keep the copy. Since we thought about these issues, we found out that BibTeX data is key/value object. key1 could be authors, key2 journal, etc. - Thus, we can put versions into fields!

This also answers your question:

Thanks also for the input of field vs entry sync. Need to think about this in more detail. What are the advantages of the field sync in your opinion?

If the server touches field1 and we modify field2, we can easily (!) be aware of these changes.

If we do not do that and version the complete entry as whole, then we ran into issues as you describe. Then, the local view is absolutely necessary to correctly merge the updates.

But the "correct" behavior would be to delete the deleted field and merge the other changes. So how does one recognize that locally the field was removed and not that the server version had the field added?

With version on the field and notifications that fields are added and removed.

That's why I proposed to have a "local version" that keeps track of unsynced local changes (you can see this as another optimistic lock).

We somehow tried to avoid the local extra copy, but I agree its necessary if one really works offline.


I would store a synchronization state instead of the plain data view of received of the server.

field1:
  serverVersion: 1
  clientVersion: 1
field2:
  serverVersion: 1
  clientVersion: 2

Then, each sync attempt can use this information. If the server version changed, it needs to be fetched and merged accordingly.

I thought, this is easier than a BibTeX-Diff. - However, a BibTeX diff can IMHO produce similar data. For @obraliar and me, the field versions seemed to be easier to implement and test. - We found a proper BibTeX diff harder than a field diff.

In the case of overleaf, I would do a BibTeX diff. Based on that, I would create a field diff and set the version number of the fields accordingly. Based on that, I would do the sync of the local state with the server state. That would be more easy for me, as I find there is more control on each phase (diffing -> handling of field diffs)

tobiasdiez commented 3 years ago

Thanks again. I thought a bit more about the per-field vs per-entry versioning system. Disadvantages I see with per-field are as follows:

Also, if I understand the code correctly, it currently only supports versions per entry anyway. https://github.com/JabRef/jabref/blob/fa8cc829727ffddd97a9056fd466e7098577f4d7/src/main/java/org/jabref/logic/shared/DBMSSynchronizer.java#L195 Do I miss something here?

koppor commented 3 years ago

(Notes from DevCall 2021-06-07)

TLDR: Version bibtex entry, make merge server-side in one transaction --> offer options to user

Where to store version?

Versioning

Optimisitc offline lock

"Relative time" (see also vector clocks)

Timestamp-based sync

Personal notes:

* Higher data volume: At the begin of the sync session, you need to query the server for the version of each field. That's almost the same payload as just querying the complete library. For a relatively normal database of 10k entries, you will need to query 100k version numbers.

No -> server-side processing.

* There might be edge cases where users would actually prefer to be warned about updates of other fields. Consider, for example, a book entry whose authors correspond to another edition to what is specified in the edition field. User A recognizes this and updates the edition to the correct edition. User B also recognizes the mistake, but chooses to keep the edition and instead updates the authors. This can be merged without any conflicts as the user update different fields, but the resulting entry is wrong.

Fuck :)

Also, if I understand the code correctly, it currently only supports versions per entry anyway.

https://github.com/JabRef/jabref/blob/fa8cc829727ffddd97a9056fd466e7098577f4d7/src/main/java/org/jabref/logic/shared/DBMSSynchronizer.java#L195

Do I miss something here?

Can't be true. Code lies. :)

tobiasdiez commented 3 years ago

I finally found a bit more time to think about this.

Olly had a very good point during the DevCall pointing out that users can also edit entries outside of JabRef, and this needs be picked up by JabRef before it syncs (so the scenario is: user closes jabref > edits bib file manually > starts jabref). Regardless of where we save the local version number (in the entry or in a separate file next to the entry), the user edits don't trigger an update of the version number and thus the sync algorithm gets confused. This means we need a reliable way to make sure the local version is increased also when the user changes the bib file.

My proposal for this is as follows: In a separate file (called "sync file" in the following), we save the following:

Once JabRef starts (or more exactly loads the bib file) and there is a sync file, then JabRef will compute the hashes for all entries and compares them with the hashes in the sync file. If the hashes coincide, then we do nothing. But if the hashes are different, the user changed the entry manually and we increase the local version number. Only after this update progress, the sync with the server is initiated.

@JabRef/developers What do you think?

I'll update the issue description with the new proposal.

calixtus commented 3 years ago

Does this mean we need keep two files in the future for one library - a library file and a hash file? This would be probably the worst solution imho.

tobiasdiez commented 3 years ago

One could also store the hash in the entry itself.

But why do you think a separate file is not a good solution? There are certain advantages of storing this externally, like not polluting the entry with sync-related data (makes it easier to send the entry to other people) and users cannot accidentally delete/modify the sync-data making everything more robust. On the other hand, it shouldn't be too hard to store the file out of sight of the user (say in %AppData%) so that they don't even notice we keep this data.

calixtus commented 3 years ago

My concern was regarding the portability of the libraries across different PCs. Would you have to share the hash files then too?

tobiasdiez commented 3 years ago

No, the user doesn't need to copy these hash files and they are solely managed by JabRef. If the hash file is not existing, then JabRef downloads the most recent version of the entries from the database, and compares them with the local entries. Any changes are treated as conflicts and need to be handled manually by the user (since we don't know which version is older/newer). And then the hash file is created.

However, if the database is sync externally as well (say using git or dropbox), there is a small performance advantage to sync the hashed file as well (which can be taken as an argument for embedding the hash in the entry itself). For example, consider the following scenario (with no sharing of the hashed file):

If the hash file are shared, then the changes via the git sync are not recognized as external changes and no unnecessary server query is invoked.

That's the only small advantage I see for storing the hashed entry in the entry itself.

calixtus commented 3 years ago

Yes, that was more or less what I meant. 😅 But if you say the performance disadvantage is negligible if not sharing the hash file as well, I guess it's ok.