digidem / mapeo-core

Library for creating custom geo data and syncronizing via a peer to peer network
23 stars 2 forks source link

Displaying Observation IDs to users #104

Open gmaclennan opened 3 years ago

gmaclennan commented 3 years ago

Internally Mapeo attaches a unique ID to every observation, node (point) and way (area or line). Devices also have unique IDs. We do not currently expose these IDs to the user but they could be useful. @aliya-ryan, @rudokemper and @jencastrodoesstuff I would be interested in your thoughts about whether we need a way of sharing Observation IDs, here are some uses I have thought of:

  1. A user shares an observation via WhatsApp or other messaging app, and wants a way to reference back to the original record in Mapeo
  2. A user creates a PDF report, and wants a way to reference back to the original record in Mapeo

However, the IDs we use are very large (256-bit numbers) which are 78 digits long when written out e.g.

62803216861226671346839817194457547751680280778088307478677493634833867707041

This is a very large string to type if a user needs to find a record in Mapeo on the basis of an ID which is shared with them somehow.

It is possible to encode the ID using characters as well as numbers, which reduces the number of characters. One potential encoding scheme is Base 58 which uses the digits 1-9, and letters a-z and A-Z, but excluding characters which can be mistaken like 0, O, l, I. The number above encoded as base58 would be:

AM1Vrec34gWcd9fSBCP3YGd9Zq4r8rfMExemHsUWEj7e

That's 44 characters long, a slight improvement. However it is still a lot to type out, or even include in a text message.

The reason why IDs are so large in Mapeo is because we cannot choose an incrementing ID like 0, 1, 2, 3 like you would in a traditional centralized database. This is because each phone/laptop does not know how many IDs other devices have "used" already, so it does not know the next "free" ID. To solve this we use a very large random number, such that the probability of any two observations having the same ID is absolutely miniscule.

One way of improving the way IDs are shared is to only use a fragment of the ID for sharing. Within a single project, there is a limitation to how many observations can be collected, and smaller random ids still have a very small probability of "collisions" e.g. two observations having the same ID. For example, if we use 64-bit IDs, and the project has 1 million observations, the chances a "collision" is approx 1 in 37 million. With a more realistic 100,000 observations, the collision chance is > 1 in a billion. Essentially miniscule.

If we use the first 64-bits of the internal IDs for sharing then the shareable base-58 string would be 11 characters long e.g.

jKb3C98gtxt

There is still the problem of the user making a mistake when typing this out. One way around that is to use an encoding that also includes an internal check, so we can give feedback to the user that the ID is entered incorrectly, rather than the ID not existing. One way of encoding with a check is Base58Check. This does add another 6 characters to type though, so it may not be worth the trade-off. The same ID above with a check would be:

5mvxAMLYMYkMUWrBW

However we implement this, we should stick with what we choose, so that shared IDs remain consistent and usable into the future. I think 64-bit encoded as Base 58 is a good option.

In terms of implementation, if we use the top 8 bytes of the internal id, then we might not need another index, since we can use leveldb lt and gt to match the 8-byte shared ID to the 32-byte internal ID.

If you're interested in the code and math behind this, see this code example and some background on the collision maths

okdistribute commented 3 years ago

How about something similar to what three words? https://what3words.com/clip.apples.leap

We can use the location as the starting place for the three words, or we can generate three words from the id using a word list.

gmaclennan commented 3 years ago

If we only want to link back to a location in Mapeo, I think we should be able to do it with 3 words. However for linking back to an observation record specifically I think we need at least 64-bits of entropy, although as you say with location encoded first the entropy might be enough with 4 words. We could use something like BIP39 and use 6 words to get 64-bits of entropy. I wonder if there are open-source implementations of the what-three-words algorithm? I know that particular algorithm has faced a tonne of criticism.

gmaclennan commented 3 years ago

Re-visiting this, I think Base32 would be a good option, since it avoids similar-looking characters and is not case-sensitive, reducing the chances of error. A 64-bit buffer encoded to Base32 would be 13 characters long. I think this is a good compromise between keeping the IDs short, and reducing the chance of user-error if a user does need to enter an ID manually.