Website tips&trick section

michielbdejong commented 1 year ago

The website has an 'advantages' section that explains the "Why" of Federated Bookkeeping.

May be can also add a 'tips and tricks' section that explains what we have found out so far about the "How". Some ideas:

"rule #1": technological sovereignty of nodes in the federation. Assume the internetwork is infinite and has no government, so we can never impose a single protocol or way to do things and expect all nodes in the federation to obey it. For comparison, even IPv4, though extremely successful, is not the only protocol used on the internet; some traffic goes over IPv6 or other protocols that routers may support.
multi-hop: Assuming the internetwork is infinite, it's not feasible to connect each node with each other node. However, a multi-hop path can exist between any two nodes.
Lookalike Groups: Entries in a database usually need to have distinct values for the primary key. But this primary key will in general not be transferred when translating data to other nodes; so when removing the local identifiers, table rows may become indistinguishable. In CSV files, identical rows are generally allowed since the line number acts as a primary key on the data. When syncing this data, it's a useful trick to count the number of lookalike entries, and instead of aiming for a one-to-one mapping, aim for a mapping where the number of members of each lookalike group is equal on both systems.
next-level formats: when converting data from format A to format B, if format A has some field that format B doesn't have, but format B does have a free text field or a binary data field, then the information from format A's field can be wrapped into this free form field of format B. Example: suppose the data is about bank transactions, and format A has a field called "transaction fee", which format B is missing. If format B does have a field called "comments" then a next-level format would be a convention where the transaction fee from format A appears as a formatted string inside the comments field of format B, e.g. "... [fee=1.25%] ...".
cliques: within a network, there can be fully connected subgraphs. This is a generalisation of "the club", or "tier-2" as researched in our timesheets project.
CRDTs: in regions of the network where nodes cooperate closely, CRDTs can be a useful tool
Routing as Access Control: Don't forward data to nodes whose administrators should not have access to that data
Routing as Authentication: Only trust data that comes in from the direction which it was expected to come from
Digital signatures: when forwarding data over multiple hops, translating at each hop, it feels like digital signatures can be a useful tool. But signatures can only be created over a specific representation of the data, so when translating the data to a different format, the verifiability of the signature is not preserved. We don't know yet if and how we can use digital signatures to maintain auditability, this will be something to brainstorm about in milestone 1 of the task tracking project.

gsvarovsky commented 1 year ago

Lookalike Groups

This could lead to data integrity anomalies when data is updated (it sometimes matters which duplicate to update, if additional data has been attached to it in the target system). In general, I think it's always best to store the source system's key, in whatever form it was given. In the CSV example the only option is the row number, but this is still better than relying fully on a "fuzzy" match of some of the data content.

next-level formats

This is definitely an anti-pattern, that leads to degradation of data fidelity! Of course, it's very common and sometimes it's the only thing you can do. But since this is a aspirational document, I think it would be nice to recognise that a much better way is to have a conversation with the target system's owners, and get a ticket on their backlog to provide the structured field. In the ideal world to which we aspire, systems should be agile!

Digital signatures

https://github.com/m-ld/timeld/pull/96 has further analysis (milestone #6 of the timesheets project)

michielbdejong commented 1 year ago

store the source system's key

Yes! That is definitely a useful thing to do whenever it's possible - i.e. if the source system exposes it, and if either the destination system or some third system has something like a database table for that. But if one of those two conditions fail then I think it can still be meaningful to sync the information about lookalike group member counts between two system, especially in situations where the data is immutable (such as entries in bank statements).

michielbdejong commented 1 year ago

a much better way is to have a conversation with the target system's owners, and get a ticket on their backlog to provide the structured field.

Yes, true! It would actually be interesting to experiment with that. Some systems (e.g. ones operated by startups) will probably prove to be more agile than others (e.g. SWIFT messages between banks).

Research question: to what extent can we convince operators of database systems to add hooks and/or custom fields into their schema, to accommodate linking with "foreign" data that their systems may not natively support?

michielbdejong commented 1 year ago

In the ideal world to which we aspire, systems should be agile!

Although I agree with that at face value, maybe it's a sovereign right of an information system not to have agility as a priority?

Suppose a new standard is published to link profile pictures to bank accounts. That way, when you receive a bank transfer you see a little picture (company logo or person's mug shot) next to the entry in the bank statement, similar to how a message is presented in a chat app like WhatsApp.

A nice idea, but some banks will probably not update their bank statements layout to display these profile photos because they're just not that hip. But through PSD2 ("OpenBanking"), a startup could create a "bank statement viewer app" that fetches the transaction data from your bank, so the choice of data viewer app is separate from the choice of data storage system.

There would need to be a hook that reliably connects data in the legacy system with data in a third-party database (in this case the IBAN account number could be used as a unique identifier). We can experiment with this.

Research question: Maybe connectedness can work around a lack of agility?

gsvarovsky commented 1 year ago

maybe it's a sovereign right of an information system not to have agility as a priority?

I agree with the principle of course. This is surely bit like languages and laws – to gain the advantage of collaboration you choose to adopt some conventions which technically erode your sovereignty. I think having 'agility' on that list will certainly make things very awkward for some, as you rightly argue – but should we try to have the right incentives in place?

a hook that reliably connects data in the legacy system with data in a third-party database

This is now a bit more like a conventional federated data system, in which queries are distributed; as opposed to the timesheets-style gossip-based protocol. I wonder if we should explicitly compare and contrast "federation-on-read" (distributed queries) with "distribution-on-write" (gossip).

michielbdejong commented 1 year ago

contrast "federation-on-read" (distributed queries) with "distribution-on-write" (gossip).

@mlesmenio this morning you asked this exact question, right?

mlesmenio commented 1 year ago

Yeah I talked about using distributed queries as an idea which would allow bypassing loss of information on translations, and also avoid storing a complete copy of data on all the nodes

federatedbookkeeping / research

Website tips&trick section #34