dat-ecosystem / dat

:floppy_disk: peer-to-peer sharing & live syncronization of files via command line
https://dat.foundation
BSD 3-Clause "New" or "Revised" License
8.25k stars 449 forks source link

codec API for underlying dat storage #85

Closed max-mapper closed 10 years ago

max-mapper commented 10 years ago

we should add a generic API for writing/reading data to and from a backend, similar to the difference between levelup/leveldown

the codec API should present an API for encoding/decoding keys, values and sequence indexes (e.g. https://github.com/maxogden/dat/issues/47#issuecomment-39481407) based on the table schema (not sure what format the table schema will be in yet, maybe https://github.com/Sannis/node-proto2json or maybe it can be pluggable?)

the current multibuffer encoding/decoding and key encoding/decoding would become a new standalone codec module called e.g. dat-multibuffer-codec

we also need a protocol buffers codec. if it's as fast as the multibuffer codec we should use it for the default codec

it should be possible to write e.g. a git codec or a flat files codec. should we just have the codec use a passed-in levelup instance and write gitdown and fsdown modules to make this happen?

maybe dat-transformer can be used for some of this?

mafintosh commented 10 years ago

@maxogden how does the data being emitted from readStream and changesStream look today? Do we store header/schema stuff in the codec layer as well?

max-mapper commented 10 years ago
max:dat-test maxogden$ dat init
Initialized dat store at /Users/maxogden/Desktop/dat-test/.dat
max:dat-test maxogden$ echo '{"hello": "world", "foo": "bar"}' | dat --json --primary=hello
{"_id":"world","_rev":"1-c80c08c1bcaaa4aad4845bba6a4c3d15"}
max:dat-test maxogden$ echo '{"hello": "world2", "foo": "bar2"}' | dat --json --primary=hello
{"_id":"world2","_rev":"1-f59370f123b466350ae4cb71e2a48258"}
max:dat-test maxogden$ dat dump
{"key":"config","value":"{\"columns\":[\"hello\",\"foo\"]}"}
{"key":"ÿcÿworld","value":"01-c80c08c1bcaaa4aad4845bba6a4c3d15"}
{"key":"ÿcÿworld2","value":"01-f59370f123b466350ae4cb71e2a48258"}
{"key":"ÿdÿworld2ÿ01-f59370f123b466350ae4cb71e2a48258","value":"\u0006world2\u0004bar2"}
{"key":"ÿdÿworldÿ01-c80c08c1bcaaa4aad4845bba6a4c3d15","value":"\u0005world\u0003bar"}
{"key":"ÿmÿ_rowCount","value":"2"}
{"key":"ÿsÿ01","value":"[1,\"world\",\"1-c80c08c1bcaaa4aad4845bba6a4c3d15\"]"}
{"key":"ÿsÿ02","value":"[2,\"world2\",\"1-f59370f123b466350ae4cb71e2a48258\"]"}

values are currently encoded as multibuffers, columns are arrays of strings

I think we need to store the encoding in the config key, and make the columns an array of objects instead (for more flexibility like storing a 'type' field on a column in the future)

mafintosh commented 10 years ago

The value could just be multibuffer as well. Each buffer entry in the multibuffer + the column schema would then be passed to the decoder when reading the data. Something like:

{"key":"config", "value":"{columns:[{name:'hello': type:'string'}, [{name:'foo': type:'number'}]}"}
{"key":"...", value: multibuffer_with_two_buffers

The decoding api would then just be columnEntry = decode(columnSchema, columnBuffer)

jbenet commented 10 years ago

The table schema should probably use the transformer types. This will make applying all the transformer conversions to the columns super easy. All that requires is that the schema be expressible as a transformer type, say:

{
  "name": "person-name",
  "address": "us-street-address",
  "dob": { 
    "type": "date",
    "label": "date of birth",
  }
}

transformer would fill in the blanks and be able to infer the complete type definition:

{
  "@context": "http://transformer.io/context.json",
  "id": "maybe-a-dat-table-id",
  "type": "Type",
  "schema": {
    "name": {
      "type": "person-name",
    },
    "address": {
      "type": "us-street-address",
    },
    "dob": { 
      "type": "date",
      "label": "date of birth",
    },
  }
}

allowing you to then execute transformer conversions directly on the rows. Like changing that us-street-address to something more general, like a world-street-address would be applying the transformer:

var rowData = .... // stream
var newRowData = transformer('us-street-address', 'world-street-address', rowData);

:)

As far as transformer is concerned, the underlying schema though can be anything you want (protobuf, multibuf, json, xml (ew)), because the schema itself -- as a json document -- can be encoded/decoded using transformer codecs (woah recursionnnnn!). And, I think that this will actually depend on the backend. If I want to use flat-file CSVs as a backend, you'd have the schema in the first row of the csv, or maybe as a companion json file, rather than a protobuf encoded binary file.

@maxogden -- mad science question. What's the interface for a dat backend? can it be reduced to a key-value store?

So this may be totally unrelated, but depending on the API, it may be useful.

I wrote a project (datastore) a long while back (in python) to be an interface for key-value stores. The idea was to use a layer of indirection between App <> storage backend, and write modules that allow all sorts of things to be backends (leveldb, redis, mongo, raw fs, aws, git, elasticsearch, etc). It's been super helpful to me in a number of projects. I recently started a go port.

One mad-science side effect of datastore is combinations (a datastore wrapping other datastores). (e.g. you can do sharding, caching (tiers), namespacing, etc within datastore). This became super useful in decomposing, hot-swapping, or migrating backends. Also great for debugging. ./my-server --datastore localfs vs ./my-server --datastore production. or even ./my-server --datastore [logging, production]. All the functionality to do the swapping became super simple.

This modular approach fits the npm lifestyle, so can port it to js. :)

jbenet commented 10 years ago

The decoding api would then just be columnEntry = decode(columnSchema, columnBuffer)

Or in transformer:

decode = transformer.convert('my-dat-backend-type', 'dat-column-type');
columnEntry = decode(columnBuffer);

@mafintosh is columnEntry here just the schema, or do you include data too?

max-mapper commented 10 years ago

@jbenet a dat backend (https://github.com/rvagg/node-levelup/wiki/Modules#storage) is anything that can implement the abstract-leveldown API, it can be reduced to a key/value store that stores keys in lexicographically sorted order and provides a way to iterate over the key range forwards + backwards

mafintosh commented 10 years ago

@jbenet In my example columnEntry was just the data. We would have some sort of stream you would pass the decode function to that would make sure to decode all rows as you started reading data. Using the this approach the backends wouldn't need to care about anything else besides storing keys as multibuffer and columns as multibuffer.

mafintosh commented 10 years ago

In the schema.json file there will a top level type key that will be a table wide setting. This will be the global encoding/decoding format for each row of data in dat.

Default is protobuf (using the proto2json format here)

{
  "type": "protobuf"
  "messages": {
    "Row": {
      "fields": {
        "num": {
          "rule": "optional",
          "type": "float",
          "tag": "1"
        },
        "payload": {
          "rule": "optional",
          "type": "bytes",
          "tag": "2"
        }
      }
    }
  }
}

Another encoding would be multibuffer

{
  "type": "multibuffer",
  "columns": [{
    "name": "num",
  }, {
    "name": "payload"
  }]
}

This is passed to an encoder constructor function like so

var encoding = theChosenEncoding(schema);
encoding.encode(data);   // should return a buffer
encoding.decode(buffer); // should return an decoded object

This api is compatible with levelup which means we can just pass the encoding directly to the levelup instance and have that handle encoding/decoding for us.

mafintosh commented 10 years ago

Constraints

max-mapper commented 10 years ago

closed via https://github.com/maxogden/dat/commit/550594a0e0cbd407583a1e95ffb241b1c759fa57 thanks @mafintosh