MolSSI / QCSchema

A Schema for Quantum Chemistry
http://molssi-qc-schema.readthedocs.io/en/latest/index.html#
BSD 3-Clause "New" or "Revised" License
96 stars 36 forks source link

How to represent large arrays #15

Open tovrstra opened 7 years ago

tovrstra commented 7 years ago

Large arrays (e.g. density matrices when using a lot of basis functions) may cause some efficiency issues. At the moment all arrays are represented as a list of (lists of ...) numbers. This has some limitations:

1) Writing a double precision number as text in JSON gives about +150% overhead. This may be solved to some extent with BSON or HDF5, provided arrays can be recognized easily before they are written to the file.

2) Reading a long list of data, then detecting that it is an array and then converting it to an array is slow and very annoying in some programming languages. It would be very convenient to know in advance that the list is an array with a given shape and data type.

Both aspects of this issue can be fixed by representing arrays as follows and to handle such data appropriately when reading and writing a BSON or HDF5 file.

{"density matrix": {
  "shape": [10, 10],
  "type": "float64",
  "array": [1.234123, ...]
}}

It would even be possible to extend the JSON format to encode the array with base64 encoding (only +33% overhead). I know that such practical details are not really a part of the JSON schema. However, it is relevant to define the schema such that arrays can be read and written efficiently.

cryos commented 7 years ago

I totally agree, and think we should favor representations in this form. There is some push back as it is then necessary to do some work indexing into the arrays, but I think it is worth it for the points mentioned above.

tovrstra commented 7 years ago

I should ping @sunqm on this issue. I forgot earlier. Sorry...

dgasmith commented 7 years ago

Agreed, we might want to roll this into our units tech as well as that follows a similar format.

dgasmith commented 7 years ago

Another way to store large arrays is to hold the binary data directly:

{
 'data': b'\xf3<\xa1...',
 'shape': (4, 4),
 'dtype': '<f8'
}

Most languages can understand this fairly easily and we do not have to go through the standard serialization/deserialization process where some data can be lost. Kills readability, but for larger arrays its likely the way to go.

sunqm commented 6 years ago

As a binary format, does big endian / little endian need to be considered?

tovrstra commented 6 years ago

@sunqm One must know the byte ordering, which is indicated by the < in '<f8'. (It means little-endian.)

I'm not sure if supporting binary representation is that important. One is free to use the spec with a JSON alternatives that nativly supports (efficient) storage of binary data and take care byte order. One could see it as a problem to be solved outside the spec.

sunqm commented 6 years ago

I agree to @tovrstra. It seems hard to test and debug for the binary format. I think a readable text is definitely needed and the binary format can be an optional field if performance or file size matters in certain scenario.

BobHanson commented 6 years ago

Since we are talking about a data transfer spec, I do think this is within scope.

The following is in relation to the sparce-array issue, but I suggest it could be a binary solution as well...

What I am implementing for my first try at Jmol just for atom names and symbols in topology.atoms is a simple run-length-encoding of arrays with a first-element descriptor:

["_RLE_",.....]

My decoder simply first checks the first array value, and if it reads "_RLE_", then we know it is RLE-encoded. But the developer certainly has the option to not do that if desired.

My suggestion is to allow different array "headers" like this -- so that it is still just an array being read:

["_BASE64_","<base64-encoded string>"]

The context should handle the data type, but if that is also desired, then perhaps a simple qualifier such as:

["_BASE64_/float32", ....]

for example. Very simple and flexible.

Endianness is important -- and should be specified. I prefer little-endian, which is very common.

"shape" seems like an odd choice for a dimensionality descriptor. Why not just "dim": [4,4]

I realize that some may think it better to have some sort of other meta tag that would indicate array type. Here are the arguments against that:

In fact, another option would be for long arrays would be to use a true data URI:

["data:application/octet-stream;type=float64;base64,......"]

Alternatively, then you really don't need the array:

"data:application/octet-stream;type=float64;base64,......"

but then that is a totally different solution than a run-length encoded array, where you need both a data-type header and numeric array data. But maybe we would want to have:

["data:application/qcschema;type=rle,......"]

I might experiment with that.

Bob

dgasmith commented 6 years ago

As a note you can use single quotes to do a verbatim ["BASE64/float32", ....]. Or you can use triple single quotes to do code blocks with syntax highlighting:

import numpy
print(np.arange(5))
tovrstra commented 6 years ago

@BobHanson The name "shape" comes from the numpy API (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.shape.html) but it seems to be a better choice than "dim", even without considering Numpy conventions.

In the Numpy API, "dimension" or "dim" is used to refer to the number of elements in shape, which is consistent with common terminology: a 1D array corresponds to a vector, a 2D array corresponds to a matrix, but there can also be higher-dimensional arrays. Likewise, "shape" is a also a name consistent with common terminology, e.g. it determines whether a 2D array is square or rectangular.

I like your ideas regarding the encoding, but what do you do in case a list just happens to start with "_RLE_" accidentally?

BobHanson commented 6 years ago

OK, "shape" it is.

RLE

I think something like this would work.

Thus my suggestion that the first element of a sequential numerical array optionally be a string that describes the compression method.

dgasmith commented 6 years ago

Putting compression inside the spec itself could cause quite a few issues of interoperability as these libraries can be relatively heavy to import into languages. Why not just compress the whole JSON file while sending it between points?

at the same time, a parser should not have to read two key/value pairs just to read an array.

I dont quite understand why this is important, reading multiple pairs compared to checking the first index of the array doesn't seem terribly different.

BobHanson commented 6 years ago

On Tue, Dec 19, 2017 at 9:06 AM, Daniel Smith notifications@github.com wrote:

Putting compression inside the spec itself could cause quite a few issues of interoperability as these libraries can be relatively heavy to import into languages.

Why would that be? The interface libraries will have to be written no matter how it is done. What makes this "heavy"? I'm missing your point there.

Why not just compress the whole JSON file while sending it between points?

at the same time, a parser should not have to read two key/value pairs just to read an array.

JSON files are gzip-compressed automatically in transmission always. Compression is not really the issue. A 64-bit float such as "1.2345678E-5," is 12 bytes; its binary form is 8 bytes, and its base64 form is probably about 11 bytes as well. No huge gain there.

The issues are:

(1) parsing of numbers rather than reading them from binary, which is slower. (2) data integrity -- binary-to-ASCII-to-binary is lossy; binary-to-base64-to-binary is lossless.

I am just stating my preference for in-line array header compression indicators rather than some sort of specialized supplemental key, as is done in many binary formats. It may seem a bit odd as part of a JSON format, but it is quite customary in data transfer formats that involve binary data.

I don't quite understand why this is important, reading multiple pairs compared to checking the first index of the array doesn't seem terribly different.

It's actually considerably different. If for every array one has to read two keys, one of which may not exist, just to find out if it is compressed, two hashtable lookups have to be done, and there have to be two separate keys in the hashtable (and it has to be a hashtable that is surrounding the array, not the array by itself or another array wrapping it); if just checking a first element for type, that's trivial, compact, easy to pass along without loss, and totally doable.

cryos commented 6 years ago

I share some of @dgasmith's concerns, what I think we are looking for is a spec that can be implemented using an off the shelf JSON parser. Optimizing for hand-written JSON parsers, and extending for highly efficient binary transport seems to be pushing things in a very different direction. For C++ we currently use jsoncpp, and I am looking at a more modern json header library. In Fortran Bert used this, and Python we usually just import the json module.

This was part of the reason there was push back on using JSON pointers, many parsers do not support them at this point. The more we put in the spec the more difficult it will be to develop support in each language, and I think that will make it less likely to be adopted by quantum codes etc. Personally I would rather develop a HDF5 or binary sister format, than attempt to create a hybrid mixing JSON with binary encoders, compression, etc. The HDF5 dependency raises the bar quite a bit too though, and effectively requires linking to the HDF5 C library to the best of my knowledge at this point.

If you look at the work done in XDMF it is interesting to note how they use very standard XML, and point to external binary resources which might be binary raw blobs, HDF5, or similar. We have seen some great uptake, but they took great care not to mix XML and binary.

I entirely agree that work must be done whatever the spec, but if we step too far outside of JSON to the point of custom parsers, etc I think this becomes quite a different project. There is the proposal of vendor extensions that would enable experimentation with whatever we like, such as these encoded arrays. Just my opinions, there is a BSON spec I am a huge fan of too that would enable direct encoding of arrays, we use it in databases already, and making use of that for binary could be a great option for end-to-end binary support where many details are already hashed out (I haven't tried to use it for transport yet).

BobHanson commented 6 years ago

For Jmol -- and visualization in general, I think, it doesn't really matter if the format is lossless or not; binary or not; compressed or not. Because the visualization is an end product. I thought, though, people had expressed an interest in lossless transfer between calculational modules in terms of intermediate workflow. That's the reason to add a binary array option for JSON.

I might have had that wrong. But if people do what to be able to transfer actual calculational results (density matrices, for example, perhaps even just coordinates) losslessly between modules, then you have to allow some sort of binary representation in the schema.

Nothing mentioned so far suggests you couldn't use an off-the-shelf JSON parser. What is being discussed is the next step -- making meaning of that JSON-parsed data structure. Run-length decoders are about 10 lines long. They are standard and trivial. Base64 is also very standard and off-the-shelf.

I'm not seeing the point Marcus is making. JSON pointers (layered JSON) is a very different animal. That really does take specialized parsers.

BSON is a possibility, sure.

langner commented 6 years ago

My vote would also be to get a text-based array format finalized first. As mentioned, different binary representations can be tried out as "extensions", since it's unclear what the right thing to do is.

It might be the case that it would be best to support several different (binary) representations if arrays, and signal the choice at the top of the file.

tovrstra commented 6 years ago

I'm also in favor of keeping the spec as simple as possible and introduce fully orthogonal extensions later to obtain compact representations.