Create an Array interop crate

anowell / are-we-learning-yet

How ready is Rust for Machine Learning?

http://arewelearningyet.com

Creative Commons Attribution 4.0 International

441 stars 61 forks source link

Create an Array interop crate #14

Open vitiral opened 7 years ago

vitiral commented 7 years ago

This is based on the conversations referenced in #13

I want to try and have a conversation with the relevant crate authors/maintainers that create their own Array data types in rust. I would like to see if we can get an agreed upon basic data structure and API (Trait) for converting to/from the different types with zero* cost. This issue is intended to be the forum for those discussions. I'm hoping such an effort will be relatively easy and not limit the API's of each individual crate.

Once we get such a crate, we should document it on this site. I think telling the story of why there are different crates is really important because a lot of people expect rust numerical types have a standard in the same way as python does. If you can explain that the conversions are simple and basically free, then explain that they allow for useful features (like array multiplication with *) I think there will be a lot more support and the ecosystem can thrive with less friction.

This thread is for discussion on this topic and is open to anyone. Particularily welcome are the numerical library crate maintainers.

The proposed name of the crate from me is array-interop

* zero meaning no copying or moving (in memory) of the arrays, dimensions etc might be lost/gained.

vitiral commented 7 years ago

These requirements are subject to change per the discussion. I've chosen to create what I think are the most reasonable to get us started.

We will call the conversion trait TryFromArray, the base type Array and the conversion crate array-interop

The basic requirements are:

converting shall not copy or move the array of data
- having to create/delete dimension or lengths is acceptible -- although a standard shall be created so that crates with the same information can do a direct memory map (avoid creating a new dimension array)
converting shall utilize the TryFrom trait in std
each library shall only need to implement TryFromArray to convert to/from any other data types which implements that trait.
the library shall use a struct which is interoperable with C for conversion with libraries outside of rust
array_interop::Array shall be interoperable with numpy (not identical to their data type but zero-cost convertable into it with a bridge library), to allow interop with the the current state of the art in research languages.

Definition of Done:

we shall set up a repository which implements the array-interop crate and agree to use it

bluss commented 7 years ago

Need to think about version coupling and which way to organize this so that it works smoothly.
- Does a numeric library for example depend on array-interop, or the other way around?
A public dependency requires for a crate that it needs to bump the major version when the public dependency is version bumped
- Public dependency: Types or traits of the dependency are used in the API of the crate
TryFrom is neat but not stable which is a major drawback to it
We want to have conversion of views or view-taking of a foreign data structure as well. For example rulinalg's MatrixSlice and ndarray's ArrayView are view types. (This could be covered by the same trait.)

Andlon commented 7 years ago

Thanks for bringing up the topic.

Let's make one thing clear first. Are we talking about interop for arrays (potentially multi-dimensional of many dimensions) or matrices (and possibly vectors)? Or stated differently, do we intend this to be an interop solution specifically for linear algebra or for storing data in multi-dimensional arrays?

vitiral commented 7 years ago

@bluss

I would say the dependency would be similar to how serde works (albeit MUCH simpler). Numeric crates would depend on array-interop and implement the TryFromArray trait. I would argue it NEEDS to be this way, since many of the crates are going to want to hide the details of how their data is stored (they will be private) making it impossible for an external crate to access that information (part of the point of array-interop is to be able to expose this info in a standard defined API).
agreed: any changes to array-interop will require a bump of the major version. Therefore it should be 1.0 before any of the array libraries use it in their master branches.
I didn't realize TryFrom is not stable... we'll have to figure out something to do then.
I didn't think about Views, that is extremely important. I've given a possible solution below.
The other thing that needs to be covered (at least eventually, maybe not in 1.0) is sparse matrixes.

@Andlon: my understanding is that both 2D-Arrays and Matrixes typically store their data in the same way: a single long row-major contiguous vector. I think we should probably try to support BOTH row-major and column-major types as separate structs, possibly RowMajorArray and ColumnMajorArray. The crate author would then only implement FromRowMajorArray or FromColumnMajorArray (as the other would have a high cost of conversion).

In that way it doesn't matter whether the library treats the data as a Matrix or an Array -- the data is the same, only the operations are different. Looking at it another way, the functions/methods of each library might be different, but they are operating on data that is guaranteed to be in the same structure (and is therefore interoperable with very little conversion cost).

I think it is worthwhile to give a very, very basic overview of what the structs might look like (as I imagine them -- you are all more expert here than me):

struct RowMajorArray<T>{
    /// The data, stored in row-major order
    data: *mut T,
    /// The dimensions of the data
    dim: Vec<usize>,
}

ColumnMajorArray would have an idetical layout with a different name

the *View types would be the same but have a PhantomData<&'a T> attached to them to track their lifetime.

Notice that the data itself is just a mutable pointer whos length can be determined by self.dim.fold(|f, v| f * v) (multiplying the dimensions together), which saves us a usize. Probably we wouldn't want to use Vec for the dim field (and add a dim_len: usize) so that we can interface with nostd crates.

Also notice that this is the "simplest possible implementation." There are no type defined dimensions. The idea here is that creating/reading type dimensions introduces only a small cost and is worth it in order to have a common interop interface.

Again, this is a proof of concept to get the ball rolling. Please provide feedback as I am NOT the expert here.

bluss commented 7 years ago

I'm thinking of this as a conversion interface, so in terms of function parameters rather than a struct. Of course it could be seen as equivalent.

We will want to be able to convert data structures without allocating a vec for the dimensions. Fixed dimensionality types want to be able to implement the trait in a way that identifies a source of data of a particular dimensionality. The common case is one and two dimensional data.

my understanding is that both 2D-Arrays and Matrixes typically store their data in the same way: a single long row-major contiguous vector.

This is indeed the default memory layout of ndarray, but it is flexible. An owned array can be row- or column- major and (only applies if number of axes > 2) have other orderings of the axes as well. An array view can use custom strides for any axis.

Rulinalg has MatrixSlice which supports a custom row stride. As of this writing to my knowledge, ndarray can always represent rulinalg's data layouts but not the other way around. rulinalg can perfectly represent the default layout of ndarray's 1D and 2D arrays.

An illustration (picture) of how the strides work for the 2D case are in slide 7 and 8 of this old talk https://bluss.github.io/rust-ndarray/talk1/ I know there is no good accompanying description. This representation is used in linear algebra and other places to efficiently represent a cut-out from a larger matrix. (in ndarray, this is generalized to more dimensions than just two).

vitiral commented 7 years ago

@bluss I have made some edits to my post to include both RowMajorArray and ColumnMajorArray types. Do you think that is sufficient?

Also, let me give a more full description by including a potential trait implementation

Edit: I realized this could all be a single trait


pub trait ConvertRowMajorArray<T> {
/// conversion from a RowMajorArray can fail 
fn from_raw_array(array: RowMajorArray) -> Result<T>;

/// conversion to a RowMajorArray cannot fail
fn to_raw_array(array: T) -> RowMajorArray;

/// exposed API
pub fn from_array<F>(from: F) -> Result<T> 
where
    F: ConvertRowMajorArray
{
    T::from_raw_array(from.to_raw_array())
}

}


I *believe* if both array libraries simply implemented the conversion trait then users could always do something like the following to convert the type:

let array_b = lib_b::Array::from_array(array_a).unwrap();


As far as I know (would love feedback!), the conversion can *only* fail if the types are different dimensions. Other errors (i.e. type errors, etc) would be caught by the compiler. Therefore the `Result` would be defined as:

pub enum Error { /// The Arrays have different number of dimensions /// (i.e. 2x2 and 2x2x2) DimensionNotEqual, /// The Arrays have the same number of dimensions /// but are different sizes (i.e. 2x2 and 2x3) SizeNotEqual }

pub type Result = std::result::Result<T, Error>;

bluss commented 7 years ago

I think it makes sense to have lossless conversion between different 2D data types. This is Rust so we should use types that properly convey the ownership (Either a custom type with ownership semantic in a pointer, or Box<[T]> or Vec<T> which have ownership semantics.) I think that it does not have a noticeable overhead to convert any owning raw pointer to a Vec and back.

ndarray does not support type level properties that guarantee a particular memory layout: the conversions from an ndarray need to be fallible if they are not going to reallocate and copy data. The fallibilities in the proposal sounds inversed to me. Conversion from Row major array cannot fail, but into it can.

Supporting just row major and col major in the conversion traits sounds good to me.

Can we have a solution where we have gradually more type level information in the trait, depending on what the data structured involved support? Here's a rough grading, with excuses if I don't know all the libraries that well. (*)

Dynamic dimensions and axes → fixed number of axes of dynamic length → some axes of fixed length →fixed number of axes of fixed lengths

With example: ndarray::Array::<T, IxDyn> → ndarray::Array<T, Ix2> or rulinalg::Matrix or MatrixSlice or nalgebra Matrix → nalgebra something → nalgebra something

Also, another scale:

Dynamic memory layout → inner axis contiguous → entirely contiguous

With example ndarray::Array<_, _> → rulinalg::MatrixSlice → rulinalg::Matrix or nalgebra matrix.

I think it is good enough to keep conversion on the level of fixed number of axes of dynamic length for example, or provide both that and the dynamic axes flexibility as an add-on.

(*) Yet I don't want to involve typenum in this. Rust doesn't have a good solution to integer generic parameters, yet.

vitiral commented 7 years ago

I think that it does not have a noticeable overhead to convert any owning raw pointer to a Vec and back.

I requires a usize for each pointer to record the length (which isn't strictly necessary). That's the only thing I was trying to avoid. I am open to the opinion of library authors, but there might be cases where you have hundreds or thousands of small matrixes that you want to convert where that usize is a precious comodity (on the other hand, you are probably converting them one at a time so probably not that big of a deal).

The fallibilities in the proposal sounds inversed to me. Conversion from Row major array cannot fail, but into it can.

Ah, I hadn't thought of that use case... you're right, we should probably add another error condition where the data is not laid out correctly. Then that error would be part of the Error type as DataLayoutInvalid.

I'm a bit confused why you think it is "reversed" -- are the error conditions I listed not valid when converting from a RowMajorArray?

Can we have a solution where we have gradually more type level information in the trait, depending on what the data structured involved support?

This is certainly possible, but I want to make sure I understand the use case. Having only one type would be:

PRO: much simpler: you would really only need to implement one trait for your library's types
CON: would have added cost of creating/checking the dimension attribute
CON: would not have as many type-checked benefits.

I think having the following be compile time defined make sense:

data layout (RowMajor or ColumnMajor)
axis up to 3 (maybe 4?) with dynamic length

These should be relatively simple and will have 8 structs and traits. I feel more than this might be biting off more than we can chew too quickly. My gut is that having up to 4 axes is the best way to go, at least until rust has compile-time integers.

vitiral commented 7 years ago

Also: I feel pretty adamant about supporting nostd users. As a microcontroller developer there were many cases where I wished there was a simple array library. Rust has the capacity to support that use case. In addition, I really don't think supporting that use case will be difficut.

Personally I would use something like arrayvec to do the dirty work here.

bluss commented 7 years ago

I requires a usize for each pointer to record the length (which isn't strictly necessary).

Saving an usize here while making dimensions a Vec<usize> doesn't add up.

I'm a bit confused why you think it is "reversed" -- are the error conditions I listed not valid when converting from a RowMajorArray?

I was thinking of that if a data type implements the trait FromRowMajorArray, then it can always represent row major arrays. (As long as the implementation can restrict itself to for example 2D data on the type level.) You're right though, it is more general to allow an error there.

I do not think that supporting more type information means that we need 8 structs.

nostd sounds good, but at the same time it's a significant innovation above the status quo (because none of the invited crates support nostd)

vitiral commented 7 years ago

Saving an usize here while making dimensions a Vec doesn't add up.

What should dimensions be?

I do not think that supporting more type information means that we need 8 structs.

I'm definitely not an expert in this -- would love to know how to generalize this!

Andlon commented 7 years ago

Sorry for not keeping up with the conversation, I'm a little busy these days.

I think perhaps this is evolving into something more complicated than it needs to be, and it's also why I posed my previous question of whether we're talking about interop for linear algebra matrices and vectors or multi-dimensional arrays.

Let's consider the need for such a crate in the current ecosystem. There are multiple crates that have custom storage for dynamically sized matrix-like structures: perhaps most prominently ndarray, rulinalg and nalgebra (since linxal uses ndarray). My understanding of what was discussed on reddit was primarily that we want a way to seamlessly interop between these (and future) crates. As far as I can see, there is absolutely no point in building interop between higher dimensional array-like storage at the moment - and it's possible that there will even never be a need to do so since ndarray more or less completely fills the niche of general multi-dimensional arrays. If this need arises, such interop functionality can always be added, but implementing such functionality at the moment is trying to solve a problem that does not exist, in my opinion.

With this in mind, I suggest we keep it simple. What we need is to interop vectors and matrices, or equivalently 1D and 2D arrays. In the 1D array case, it can of course just be represented by e.g. a Vec for owned storage and a slice for "views". In the 2D array case, we need the following:

Owned: Vec (or equivalent) and rows: usize, cols: usize, as well as a notion of row major/column major storage (I agree with bluss' assessment that supporting general stride is too niche for interopability).
Views: probably a pointer, rows, cols and stride information. Probably need separate immutable/mutable types?

I like the idea of having separate traits for column major and row major storage. Particularly because afaik, at the moment nalgebra only supports column-major matrices and rulinalg only supports row-major matrices, and it's nice to communicate through the trait system that conversions between these types cannot possibly be zero-copy.

That, however, brings up an additional point. The point of this interop library is the ability to seamlessly work with matrix-like types from different libraries. However, the user experience would be horrible if we would straight up refuse to interop between two libraries because they don't support the same storage order. I believe that if zero-copy conversions cannot be done, we should still facilitate functionality so that data can be very easily copied (if needed) into the right format. However, we should make this functionality separate (i.e. separate traits) so that the user must opt-in to conversion with possible copy overhead, thus making it explicit. Of course, this only works for owned storage, and is not at all applicable to "views".

Moreover, if we restrict ourselves to 1D and 2D arrays, we can also simplify the API by avoiding return of Result types, and just return the appropriate arrays directly. In this case, we can completely communicate whether or not the array can be converted through the type system.

I know I've presented some very strong opinions in the above, and it's also possible I've misunderstood some of the above discussion. If I have made any mistakes or if you disagree with any of the opinions, please let me know!

vitiral commented 7 years ago

@Andlon thanks for the very thorough reply and I appologize for my absense. Had a more difficult time at work the last month and my wife and I are getting very close to delivering our second child... so expect there to be a gap in my participation in the near future as well.

The original idea for this library was to provide generic ndimensional interop between libraries. The reasoning was simple: adding dimensional information is MASSIVELY more efficient than the data itself, so that is an acceptable overhead.

However, having the type system check at least your dimensions is valuable and so I think the general consensus is to piviot to at least providing 1D, 2D and 3D arrays (so one more dimension than you suggest, but still the same idea).

I also like the suggestion of focusing on Views -- I see views as essential for interop, as continually loosing ownership in order to enable interop would be extremely annoying. I also fully support being able to migrate from row-major -> column-major (and vice versa) within the library itself, but to have to do so explicitly. My personal preference would be to do this with a method (example: row_major_data.to_column_major()) instead of a trait, but it could be accomplished either way.

Unfortunately, we cannot always avoid Result if we want to be able to support Sized types (i.e. Matrix2x3 for a 2x3 matrix in nalgebra)), since the conversion could still fail. Maybe we want to only support matrixes whose size is only known at run time for the first pass and use a separate trait for converting to/from type-sized matrixes? I think having separate traits would be extremely beneficial for the API, personally I always prefer the type system to tell me when errors can or can't happen.

With these design decisions I wouldn't be surprised if we were very close to attempting a library for interop support. Does anyone else have a major objection?

vitiral commented 6 years ago

I'm thinking this issue should probably be put on hold until const generics are a thing, as that will be a more "final" API for data analysis libraries. The other option would be to only support a small subset using nalgebra's strategy of having U1, U2, etc types and only support 1D, 2D and 3D arrays until const generics arrive.