digidem / mapeo-rfc

Planning larger architectural and cross-project components of Mapeo
1 stars 0 forks source link

Using hypercore/multifeed for storing blob data #5

Open gmaclennan opened 4 years ago

gmaclennan commented 4 years ago

Recording some thoughts and ideas here for further discussion. Non-urgent, but important to come back to at some point.

We store two types of data in Mapeo:

  1. Textual data: Observations, Nodes, Ways, Relations. These are stored as JSON objects in hypercores, which are managed by multifeed.
  2. "Blob data" e.g. binary data such as photos. Currently we only store photos, but in the future want to add voice recordings and video. Currently these are stored as files on the local filesystem, and synced via blob-store-replication-stream which is managed by Mapeo Core.

blob-store-replication-stream has worked well as a simple solution to synchronizing blob data. The replication protocol is simpler than hypercore and the files are stored on the filesystem, so they can be accessed outside of Mapeo. It uses abstract-blob-storage so it is possible to store the blobs in a variety of storage mechanisms (leveldb, filesystem etc). It comes with certain limitations though:

  1. Data is not signed like it is in Hypercore. It would not be possible to detect data being manipulated changed. (There is an easy workaround to this: writing a hash of the blob into the hypercore document that references it).
  2. It means we support two different replication protocols
  3. There is no "unwant" built into the sync protocol, e.g. after two sync clients exchange their wants and haves, they start downloading all blobs.

(3) becomes an issue if multiple clients connect in a "sync swarm" e.g.

Client A has 20 new photos Client B has the same 20 new photos Client C has not synced

A connects to C and exchanges haves, C wants the 20 photos. at the same time B connects to C and exchanges haves, C still wants the same 20 photos.

C will download the 20 photos from both A and B, resulting in duplicates, and unnecessary use of bandwidth. This would not happen with Hypercore because once C downloads a photo from A, it sends an "unwant" to B, to avoid downloading twice. It is less of an issue when syncing over a local network, but the speed of the local WiFi network can be a limiting factor, especially with multiple users gathering to sync. Over internet we need to be as careful about bandwidth as possible, since it is slow, photos are large, and bandwidth can cost. If we are going to support peer-to-peer syncing over the internet it's important we limit bandwidth usage and avoid multiple downloads of the same data.

One solution to the limitation of the current blob store implementation is to store blob data in a hypercore, or hyperdrive.

Hyperdrive Hyperdrive is built for this kind of use-case, but it replicates functionality: it uses multiple hypercores with one being used to store data, and the other storing an index (file paths) to the data. In Mapeo we do not necessarily need this: all media is referenced from an observation, which acts as an index.

Raw Hypercore We could store blobs in raw hypercores, and reference the hypercore ID + seq number from the observation record. This would solve 1, 2 and 3 above, and reduce the need to multiplex streams since the hypercore protocol can already multiplex multiple hypercores over a single stream. This solution comes with its own limitations though:

  1. Hypercore sync protocol currently syncs each block whole, which could have performance and memory issues for large blobs (the whole block will be in memory during decoding).
  2. Reclaiming disk space by removing data from a hypercore is unreliable: because of varying support of sparse files, "zero-ing" areas of a sparse file does not result in disk space being reclaimed reliably.

(1) is solved in Hyperdrive by splitting blobs into multiple 512kb blocks. This ensures that the blocks being synced are only 512kb, reducing the amount of data held in memory during decoding. hypercore-byte-stream contains the code to read these byte streams as a file. One way to do this with Mapeo would be to store the ID as [hypercoreId]+[start seq]+[end seq], or store the start seq and add a contentLength field to observation attachment records, which would provide the necessary data to read chunked blobs from a blob hypercore.

(2) Is trickier, and requires changing hypercore internals to use a storage mechanism that allows data to be deleted in a way that allows disk space to be reclaimed.

Another issue to bear in mind is limitations file descriptors: each hypercore keeps open size random-access instances, which if using random-access-file each keep an open file descriptor. Adding an additional hypercore for each device (for blob data) doubles the number of file descriptors during sync. Android limits 1024 open files, which with one hypercore per device limits us to 170 devices in a project, but with two limits us to 85. Since this also includes all previously used devices, it is possible that users could hit this limit within a project.

There might be a way to optimize hypercore to reduce file descriptors, for example the key and secret_key storage are rarely accessed and could close the file descriptor after read. It could be possible to use other storage such as LevelDB for some storage too, particular random-access instances that use fixed-length blocks e.g. tree.

hackergrrl commented 4 years ago

An idea I've been thinking about for blob sync:

Two peers would sync by exchanging RPCs to decide what each side wants, and then start streaming them to each other. Each side could opt-in or -out of blobs that they aren't interested in (based on whatever criteria).

This would also let us continue to use the filesystem, too, which I think partners have appreciated having easy access to?

This would address:

  1. Data is not signed like it is in Hypercore. It would not be possible to detect data being manipulated changed.
  2. There is no "unwant" built into the sync protocol, e.g. after two sync clients exchange their wants and haves, they start downloading all blobs.

As for

  1. It means we support two different replication protocols I don't know that this is a downside? I've actually been low-key thinking about ways to replace hypercore for a while, with something that scales better for many cores. We're seeing similar issues on Cabal, which some cabals easily have well over 500 cores!