Improvements for NMatrix::read and NMatrix#write (e.g., compression, yaml)

translunar commented 11 years ago

Right now, NMatrix uses C++ STL's iostream to handle binary reading and writing of matrices. That's great and stuff, but I'm not sure how to make it compatible (without several days of research) with Ruby's IO module.

Ideally, we should be able to do things like this:

File.open("matrixfile") do |f|
  f.write(m)
end

or potentially more awkwardly:

File.open("matrixfile") do |f|
  m.write(f)
end

This would also enable us to run it through a Zlib filter and compress or decompress large matrices on the fly.

Other thoughts:

We could use zlib.h to simply compress within the C++ code. That's the easiest solution, but is the least Ruby-like. It makes some sense because we're writing binary files here, and shouldn't need to stack multiple matrices in one file. On the other hand, it raises the question of what we do to pickle additional options, such as on types that inherit from NMatrix -- where does YAML information get stored?
Ideally, we only want to compress/decompress the data portion of the matrix. At the very least, the NMatrix version information needs to be in plain binary so NMatrix doesn't have to decompress the whole damn file to find out it's incompatible. That would suck with large matrices.

More later, perhaps.

Aelphy commented 8 years ago

Hi. I have thought about this problem. It is possible to use Marshal objects.

Aelphy commented 8 years ago

How do you think ?

translunar commented 8 years ago

Can you please provide some more detail?

Aelphy commented 8 years ago

The idea is to convert Nmatrix object to string using Marshal module. It supports operations load and dump. The return of dump operation is a string, that could be written to the file using the following interface:

File.open('matrixfile') do |f|
 f.write(Marshal.dump(m))
end

http://ruby-doc.org/core-2.3.0/Marshal.html

Also it is possible to add compression defining marshal_dump method for NMatrix

translunar commented 8 years ago

Is it written in C, or in pure Ruby? How does it work with C data structures?

Aelphy commented 8 years ago

I am thinking about the following realization: Marshal will call marshal_dump method of NMatrix. Up to this part it is a pure ruby. As far as I understand, inside marshal_dump it is a good idea to call C code that will compress the data and return the string to ruby code. This string will be than an output of marshal_dump routine. This string then will be written to the file using standard ruby interface. Thus c code advantages and ruby interface will be combined.

translunar commented 8 years ago

What do you mean by "compress," exactly?

Aelphy commented 8 years ago

It could be anything. Starting from passing the matrix data through Zlib. Also It is possible to implement one of methods of large matrix compression mentioned here: https://peerj.com/preprints/849.pdf .

Aelphy commented 8 years ago

In other words any data processing could be implemented on this step.

v0dro commented 8 years ago

Have you explored the numpy binary file format?

translunar commented 8 years ago

Okay. I like this strategy. Go for it.

Aelphy commented 8 years ago

@v0dro, not yet. Do you think, it is a good idea to follow the same format ? I understand that it could be useful in order to load numpy matrices, but generally I think that it is better to implement a separate method for reading numpy matrices.

Aelphy commented 8 years ago

@mohawkjohn, ok.

Aelphy commented 8 years ago

One more question. Can this task be a part of GSOC 2016 proposal ?

Aelphy commented 8 years ago

Also, I just have checked, that definition of marshal_dump method right in NMatrix class results in correct work of Marshal.dump. Now I am thinking about the place, where to define this method, because I will need the access to the Matrix data inside it.

v0dro commented 8 years ago

I have a doubt. If you're going to implement a compression algorithm for storing matrices, how can one seek to a particular element directly and then read a given number of elements from that point onward?

That functionality would be important since very large matrices cannot be stored in memory and often users will want to read off a part of it from persistent storage.

Aelphy commented 8 years ago

I agree, it is the problem. For the moment I want to implement the interface first (without compression).

Aelphy commented 8 years ago

Does NMatrix support partial read and write from disk (the problem you have described) ? Or you are talking about the problem to work with compressed matrix written on disk without NMatrix ?

v0dro commented 8 years ago

I'm concerned whether your compressed matrix will work for file seeking. Hence I don't think storing in a compressed file format is a very good idea in the first place.

Partial read/write i do not think is supported. You can add that as part of this issue.

Aelphy commented 8 years ago

I have found the article, that describes seekable compression using zlib http://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files/. I think that it could be interesting to try this approach.

I think that it is a good idea to implement partial read and write and I will think about its realization.

How do you think, is it enough for GSOC 2016 proposal ?

translunar commented 8 years ago

No, partial read and write are not supported. But I'm not sure a use-case supports this.

When you say "is it enough," do you mean for the code contribution requirement, or for a full summer of work?

Aelphy commented 8 years ago

@mohawkjohn, I am talking about the code contribution.

translunar commented 8 years ago

Yes. I think any non-trivial contribution — to show you understand the codebase and how open source development works — is sufficient.

With that said, we look very kindly on people who contribute more code. =)

SciRuby / nmatrix

Improvements for NMatrix::read and NMatrix#write (e.g., compression, yaml) #104