Open translunar opened 11 years ago
Hi. I have thought about this problem. It is possible to use Marshal objects.
How do you think ?
Can you please provide some more detail?
The idea is to convert Nmatrix object to string using Marshal module. It supports operations load and dump. The return of dump operation is a string, that could be written to the file using the following interface:
File.open('matrixfile') do |f|
f.write(Marshal.dump(m))
end
http://ruby-doc.org/core-2.3.0/Marshal.html
Also it is possible to add compression defining marshal_dump method for NMatrix
Is it written in C, or in pure Ruby? How does it work with C data structures?
I am thinking about the following realization: Marshal will call marshal_dump method of NMatrix. Up to this part it is a pure ruby. As far as I understand, inside marshal_dump it is a good idea to call C code that will compress the data and return the string to ruby code. This string will be than an output of marshal_dump routine. This string then will be written to the file using standard ruby interface. Thus c code advantages and ruby interface will be combined.
What do you mean by "compress," exactly?
It could be anything. Starting from passing the matrix data through Zlib. Also It is possible to implement one of methods of large matrix compression mentioned here: https://peerj.com/preprints/849.pdf .
In other words any data processing could be implemented on this step.
Have you explored the numpy binary file format?
Okay. I like this strategy. Go for it.
@v0dro, not yet. Do you think, it is a good idea to follow the same format ? I understand that it could be useful in order to load numpy matrices, but generally I think that it is better to implement a separate method for reading numpy matrices.
@mohawkjohn, ok.
One more question. Can this task be a part of GSOC 2016 proposal ?
Also, I just have checked, that definition of marshal_dump method right in NMatrix class results in correct work of Marshal.dump. Now I am thinking about the place, where to define this method, because I will need the access to the Matrix data inside it.
I have a doubt. If you're going to implement a compression algorithm for storing matrices, how can one seek to a particular element directly and then read a given number of elements from that point onward?
That functionality would be important since very large matrices cannot be stored in memory and often users will want to read off a part of it from persistent storage.
I agree, it is the problem. For the moment I want to implement the interface first (without compression).
Does NMatrix support partial read and write from disk (the problem you have described) ? Or you are talking about the problem to work with compressed matrix written on disk without NMatrix ?
I'm concerned whether your compressed matrix will work for file seeking. Hence I don't think storing in a compressed file format is a very good idea in the first place.
Partial read/write i do not think is supported. You can add that as part of this issue.
I have found the article, that describes seekable compression using zlib http://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files/. I think that it could be interesting to try this approach.
I think that it is a good idea to implement partial read and write and I will think about its realization.
How do you think, is it enough for GSOC 2016 proposal ?
No, partial read and write are not supported. But I'm not sure a use-case supports this.
When you say "is it enough," do you mean for the code contribution requirement, or for a full summer of work?
@mohawkjohn, I am talking about the code contribution.
Yes. I think any non-trivial contribution — to show you understand the codebase and how open source development works — is sufficient.
With that said, we look very kindly on people who contribute more code. =)
Right now, NMatrix uses C++ STL's
iostream
to handle binary reading and writing of matrices. That's great and stuff, but I'm not sure how to make it compatible (without several days of research) with Ruby'sIO
module.Ideally, we should be able to do things like this:
or potentially more awkwardly:
This would also enable us to run it through a Zlib filter and compress or decompress large matrices on the fly.
Other thoughts:
zlib.h
to simply compress within the C++ code. That's the easiest solution, but is the least Ruby-like. It makes some sense because we're writing binary files here, and shouldn't need to stack multiple matrices in one file. On the other hand, it raises the question of what we do to pickle additional options, such as on types that inherit from NMatrix -- where does YAML information get stored?More later, perhaps.