yuce commented 7 years ago

Pilosa Dataset Format

Introduction

In this document, we propose the Pilosa Dataset Format which solves the problem of re-creating the schema and importing/exporting of all bits data in a Pilosa server. We also propose a data file format called pair binary encoding which can store set bit operations in a structured way with using less space than the currently used CSV format. The proposed Pilosa DataSet Format has the following benefits:

Faster to export and import.
Requires less storage.
Contains the schema.
Contains all bits in all indexes.
The manifest may be used to quickly check the contents of the data set.
Imports and exports can be parallelized easily.

File Format

A Pilosa dataset is a collection of a manifest file and one or more data and extra files.

Manifest File

The manifest contains metadata such as copyright and the schema and the locations of the files in the dataset. The manifest file should be named manifest.json. It consists of key values in JSON format. indexes and version keys are mandatory, anything else is optional. If the consumer of the dataset doesn't know how to process a key, it silently ignores it, but if a value for a known key can't be processed, the consumer returns an error.

Values in the manifest may contain the location of a resource relative to the manifest file.

Required Fields

All datasets should include the following fields:

indexes: Contains the schema and locations of data files for an index. This key is explained further below.
version: Version of the data. Should be 1.

Optional Fields

A dataset may include the following fields. The consumer may choose to ignore these fields, otherwise, it should interpret these fields with their meaning below:

readme: Contains the location of the README file. README file should to be in UTF-8 text and use of markdown is suggested. README files contain information about the dataset and its usage, which can be very useful for public datasets.

Indexes Field

The most important field in the manifest is indexes. It contains all information for importing an index into Pilosa and may contain extra fields which would be useful doing import, such as checking the consistency of data files.

The indexes field is in the following format:

    "indexes": {
        "INDEX1-NAME": {
            "options": {
                "columnLabel": "COLUMN-LABEL",
                "timeQuantum": "TIME-QUANTUM",
            },
            "frames": {
                "FRAME1-NAME": {
                    "options": {
                        "rowLabel": "ROW-LABEL",
                        "inverseEnabled": true,
                        "timeQuantum": "TIME-QUANTUM",
                        "cacheType": "CACHE-TYPE",
                        "cacheSize": 1000
                    },
                    "views": [{
                        "standard": {
                            "data": [{
                                "file": "DATA-FILENAME-1",
                                "format": "pair",
                                "slotFormat": [1, 2],
                                "sha256": "....",
                                "bitCount": 12345
                            }, {
                                "file": "DATA-FILENAME-2",
                                "format": "pair",
                                "slotFormat": [2, 4],
                                "sha256": "....",
                                "bitCount": 100
                            }]                            
                        }
                    }]
                }
            }
        }
    }

Each first level key is an index name. First level values contains the options and frames of the index. Second level contains the frame names as keys and options and information about the views for that frame as values.

Data Field

The data field contains the location of the bit data of a view and its format. It can also contain fields to check the integrity of the data:

file *(required): Location of the data file, relative to the manifest.
format *(required): Format of the data file. Use csv for CSV data or pair for pair encoded binary files. See the Data File section below.
bitCount (optional)`: Number of bits set in the file. This field may be used to calculate the number of bits in the dataset.

Data File

The data file contains the bit data for a view. It can be in CSV or pair encoded binary. More data formats may be added to this proposal.

CSV

The CSV data files should be ASCII or UTF-8 encoded and should have the following structure:

Set bit without timestamp

row_id,column_id\n
...

Pair Encoded Binary

Variable number of bytes are specified for row IDs AND columnIDs to ensure minimum space is used to specify a set bit. Each row or column ID uses the minimum number of bytes in the [1..8] range which fits the ID. A set bit operation which contains the row ID AND column ID is called a slot. Each slot in a data file has the same size. The slot format of a data file is a pair that has the number of bytes required for each component of a slot: (rowID_bytes, columnID_bytes). For instance, if a data file contains row IDs which fit in a single byte, column IDs which fit in 2 bytes would have following slot format: (1, 2).

Note that, it is up to the exporter to decide which slot formats to use but each data file must use the same format. The exporter may aim for minimum space usage and use all possible values for the components in a slot format, which would result in 8 (row byte count) * 8 (column byte count) = 64 data files. Or, it may use only (8, 8) slot format which would result in a single data file but potentially waste a lot of space. Or, use 2, 4, 8 for row IDs, 4 or 8 for column IDs which would result in 3 * 2 = 6 files.

Implementation

The Go client supports import from a Pilosa server but not export data to the server. We can implement export support into the Go client and write a tool which uses the Go client to do the exports and imports using HTTP API of the Pilosa server. No change to the server is required.

alanbernstein commented 7 years ago

It seems there is some overlap with the input definition work here, but I'm not caught up on the current status. Will this serve a significantly different purpose?

Have you compared storage size of this binary format vs Roaring storage? Your proposed TEB is much simpler to implement, but it is not too hard to use Roaring to read/write a bitmap file.

raskle commented 7 years ago

I think we should coordinate your effort here with the import-definition. The current import-definition branch already has much of the manifest functionality implemented. This work is considered phase 1 with low volume imports JSON data implemented. We'd like to build on this in phase 2 with binary and csv file imports. Perhaps we can team up to evaluate various approaches for importing binary data.

yuce commented 7 years ago

Both input definition and imports can be used to put data into Pilosa but the similarities end there:

1) Most importantly they are used for very different purposes. Input definition is a mapping which defines what data from an external service should land in which frame and how should it be transformed. It is used for interoperation between services. Imports/exports is just the raw data from a Pilosa server which can be used to seed a Pilosa instance or share the data. etc; there's no external service involvement.

2) Input definition would be used for continuously getting data from an external system like hadoop; imports/exports would be used for getting the data initially into the server or backing up the data, so it would be used much less frequently.

3) They are used very differently. For using input definitions, the user would decide on a mapping, POST it to the server to create the mapping and link the external service to send the data to Pilosa. For using imports/exports, the user invokes pilosa (or another tool) command line and point to where the data is.

4) Input definition and imports/exports work on different levels: the former works on the server level and the latter works on the client and server level. This proposal only requires client-side implementation.

5) The imports/exports support is already in the server and all official clients have imports support. This proposal only adds a manifest file which contains the information about how to create a schema and enables us to partition the data for frames. The optional binary format doesn't require a change in the server, it only requires client support.

@alanbernstein re: saving the bits data as roaring bitmaps: Anything is possible, but the server requires the data as individual bits (rowid, columnId, timestamp) so using any other format would require re-encoding the data; like read a roaring bitmap and convert them to individual bits and send to the server.

@raskle re: using the same manifest for input definition and imports/exports. As I've explained above those are for very different purposes, so as far as I see the requirements for the manifest files are very different as well. I can't see much opportunity to reuse stuff here, except the frame options.

yuce commented 7 years ago

I've updated the proposal to include views. The data files are attached to views, instead of frames now, so the triplet binary data format is pair binary data format (since saving timestamps for views is not necessary/allowed).

jaffee commented 7 years ago

Sorry to be so late in reviewing this, but I think it's a great proposal, and I agree that we need it separately from the input definition work.

Now a bunch of opinions on the details 😉

Views should not be dumped as separate data files IMO. A view should really just be a different arrangement of the same data, so storing that data once along with stating in the manifest which views are enabled should be enough. I guess the exception in Pilosa's case is timestamps, and you've probably thought this through more than I have, but storing each view seems like it would take a lot more space than storing each bit with its timestamp.

We shouldn't support mixing CSV and pair encoded data within the same manifest. Choose one or the other for the entire dump. Is there a strong use case for intermingling the formats?

The location of each file should not be specified in the manifest - I think we open ourselves up to weird bugs with this. Have files in fixed locations based on frame names in the same directory the manifest resides (or subdirectories).

@yuce I'm wondering if you considered not having a manifest (or having a much simpler one) and giving each data file a header which described its data format, index, frame, etc. I think that would loosen the coupling between data and manifest, making it easier to share specific frames or indexes, and sidestep the issue of supporting the intermingling of multiple data formats within the same manifest. Would be interested to hear if you can think of cons to that approach.

yuce commented 7 years ago

@jaffee Thanks a lot!

IMO the manifest is quite valuable. There would be lots of files that makes the data set, and those files would be hosted on something like s3. The user would feed the URL of the manifest to the importer, and that's all the importer needs. The importer would then fetch the data files using the paths in the manifest. The importer can do a few nice things with the info in the manifest. It can display a progress bar (since it knows the number of the files [maybe even sizes]); it can check the consistency of individual files using the checksums, and re-fetch the broken ones. If the exporter saved the number of bits for data files in the manifest, the importer can present that to the user without fetching all data files. I also think it is valuable to have the index metadata, such as index name, options, frame names, frame options, etc. in the manifest. For one, it makes the data files simpler; and it shows a nice overview of the data set without the user downloading the actual data.

The manifest can also contain links to auxiliary files, such as the README. We could have a web app which stores the manifest URLs and it could automatically create web pages using the manifest, README, etc. for the public (and commercial?) data sets, such as the taxi data set. [That would allow the user immediately see what kind data the data set contains, how big is it, instructions, copyrights, etc.]

I've thought about having a header for the data files (such as the slot format) but moving those to the manifest makes the data files trivial. The disadvantage of this approach is, if the user doesn't have the manifest, the data files are useless. But don't how that can happen.

Pilosa can currently export only views, so if we don't import/export by views, the only option I can think of is having Pilosa export frames with calculating the timestamps from related views.

I don't think there's a use case for intermingling data formats, but can't think the problems this could create either. I would imagine the exporter would export files in one format (but would it cause trouble if we allowed the importer only care about the format of the file it processes?)

In the proposal, the path of the files is relative to the manifest (I think this is what you propose in your comment too). Maybe in the future we can relax this constraint so any URL can be specified in the manifest??? (far fetched idea: data sets deriving from another one).

jaffee commented 7 years ago

OK, I see your point regarding having the data files be separate from the manifest (in S3 or whatever) that could be really cool.

I'm still concerned about handling everything at the view level, although I understand that's what PIlosa can actually do right now. Views are just denormalizations of the data - exporting and importing them adds more files and redundant data. If tools outside of Pilosa want to deal with exported data, they have to decide which parts of the data are relevant to them - they also might need to reconstruct timestamps by looking at multiple files which sounds hairy. I really think we should consider implementing import/export logic in Pilosa that works at the frame level so that we aren't denormalizing exported data, and so that we keep as simple a data format as possible for exported data.

yuce commented 7 years ago

I agree with you that we should be able to export frames. That's what the initial proposal had until I realized we didn't support it yet. I can restore that which adds the timestamp to slotFormat (0 or 8) and removes views.

addos commented 7 years ago

Hi, I realize I am kinda new here. But one thing I am finding would be really nice, would be a way to just run the pilosa import command -i index -f frame filename.csv and what I'd love for it to be able to do is automatically create the index for me with the default column name, and create the frame for me, with the default row name. So I don't have to explicitly create the index or frame. Since, for a lot of the data I am importing, the defaults would be good enough for me.

jaffee commented 7 years ago

Hi @addos, thanks for jumping in!

I think those are totally reasonable additions to the import command, and this is not the first time that kind of functionality has been requested. Could you create a new issue (with basically just what you said in your comment), and we'll continue the discussion there?

It will probably be some time before the changes proposed in this ticket are implemented, but I think the tweaks you propose could be done in the near term.

addos commented 7 years ago

@jaffee I filed this as a new issue/proposal #765

travisturner commented 7 years ago

We should be able to export a time-based frame by using the data from the most granular view. For example, if a frame is configured to store time-quantum YMD, then we can ignore the Y and YM views. The data would effectively be the rowIDs and columnIDs from each YMD view along with the timestamp representation (Y-M-D 00:00:00) of that view. Note that in storage we have lost information from the original timestamp and can only represent what we know.

The other frame types should be exported from their standard view.

While I agree that Input Definition and Pilosa Dataset Format are different, I think @raskle was pointing out that both contain implementations of a "schema" and we should therefore make sure we're not duplicating effort/code if possible.

I also second @alanbernstein's question regarding roaring as a binary format. I guess it's because you're suggesting that this all happen in the client and we are therefore bound to the existing api of the server. Should we consider a new method to import/export roaring data from the server so the client isn't having to do so much encoding/decoding?

Finally, if we were to move forward with a custom binary encoding like "pair encoding", why not encode the data by row so that we're not repeating the rowID with every pair? [(row)(col, col, col, col),(row)(col),(row)(col, col)]. I'm not really proposing this, because to me, going down that path just reinforces the argument to support roaring.

yuce commented 7 years ago

We can definitely have different formats for data files, depending on the need of the target application. If we aim for a high level of interoperability and don't care for the file size, we can just encode the data files in CSV. If we aim for minimizing the space usage but don't need interoperability, we could save raw roaring bitmaps in the data files. As long as the format field for a data file is correct and there is an encoder/decoder for that format, everything should be fine.

The proposed binary format in this document is an improvement over CSV format regarding storage size and encoding/decoding speed. The format is extremely simple, so I believe a few lines of code is all it needs to implement an encoder/decoder in any programming language. I also like that size of each slot (row_id,col_id,[timestamp] tuple) is fixed in a file, so the location of each slot is known beforehand. One of the benefits is, appends and deletions are very cheap. More importantly, it enables us to have a high level of concurrency. A decoder can spawn 10 threads/goroutines and read and import 10 slots concurrently. The same holds for the encoder as well. Probably there are even more benefits of having this very simple scheme of saving data.

@travisturner I've considered saving rows as (row_id, [col_ids]) but when I thought about the implementation, I wasn't sure what we gain outweighted what we lose:

The encoder, especialy the decoder becomes much more complex.
We can't assume all row ids and column ids in a row (let alone the file) are of the same byte size. That requires an extra byte for each row/column ID which defeats the storage size advantage. Also the decoding step is a lot involved which affects the import speed.
Or, we can select row ids and column ids with the same byte count during encoding and put them together in the same file. We sacrifice the encoding speed.
The slots (row_id,[col_ids]) are not of fixed byte size, so we can't jump to an arbitrary position in the file and read the slot there, which limits concurrency opportunities.
To solve that problem, we can keep a table in the file with entries which point to slots. A decoder would first read the table and then it gains the ability to jump to arbitrary positions in the file. But then, keeping the table becomes the problem. Is it at the start of the file, or at the end; can we easily append slots to the file and update the table, etc. The storage advantage is reduced.

The case against saving raw bitmaps was lack of support for it in Pilosa like you mentioned and as far as I know it doesn't support 64 bit IDs for all major PLs. Not sure about how compatible each implementation is too. Would we use the serialization format at: https://github.com/RoaringBitmap/RoaringFormatSpec/ or develop a new one? If we had endpoints that support raw roaring bitmaps I guess at the least we could use raw roaring bitmaps as one of the officially supported data formats to have an efficient backup system.

alanbernstein commented 7 years ago

You have some good points @yuce.

Regarding Roaring serialization, Pilosa's binary file format is inspired by the spec but not compatible. It is described briefly at https://www.pilosa.com/docs/architecture/. The pilosa inspect command shows some simple usage.

alanbernstein commented 6 years ago

In light of https://github.com/pilosa/pilosa/issues/1319, in the next implementation of our dump/restore, we might consider creating dump files that are a sequence of PQL commands, similar to how pg_dump can generate a file of SQL commands, for example. In that case, this proposal wouldn't exactly work as-is, but ideas from this could inform new PQL calls to be used in a dump file.

FeatureBaseDB / featurebase

Proposal: Pilosa Dataset Format #725

Pilosa Dataset Format

Introduction

File Format

Manifest File

Required Fields

Optional Fields

Indexes Field

Data Field

Data File

CSV

Pair Encoded Binary

Implementation