fre-hu / mdarray

Multidimensional array for Rust
Apache License 2.0
11 stars 0 forks source link

Naming suggestions #2

Closed grothesque closed 2 weeks ago

grothesque commented 2 months ago

Coming from C++, Python/Numpy and Rust, my impression is that some of the most central concepts of mdarray are named in a confusing way. Perhaps this is due to my rudimentary understanding, but perhaps the author will find an outsider's perspective interesting. Here are some of my thoughts.

Span

In C++ there is std::span, but in Rust the equivalent is slice. The same name is used in Python/Numpy, so many potential users have been exposed to it. Wouldn't it be better to use the name Slice instead of Span in mdarray? (E.g. there would be Slice and RawSlice, etc.)

But if I understand correctly, mdarray's Span is not really the equivalent of C++'s mdspan. (That would be rather Expr.) Mdarray's Span seems rather a simple reference to an existing array. So how about renaming it to ArrReference or ArrRef?

Grid

I'm not aware of the name "grid" being used in numerical computing, except in the sense of a physical grid (e.g. a hexagonal grid), or in "grid computing". So while short, Grid does not seem intuitive for "owned array". How about renaming Grid to simply Array? Alternatively, if ownership is to be emphasized, how about OwnedArr?

Array

Static arrays in mdarray are called Array which nicely mirrors the usage in Rust, but arguably conflicts with the expectation that the name "Array" in an array library denotes a heap-allocated array. How about renaming it to StaticArr?

Expression

In computing, the term "expression" means "something that can be evaluated to determine its value". If my understanding is correct, mdarray's trait Expression is an array iterator. Why not call it ArrIterator then? Then there could be IntoArrIterator etc. This naming should be much clearer to people having some experience with Rust.

Expr

If I understand correctly, Expr is an "owned view". While Span is just a "reference to an array", and Expr has its own shape and stried. Only the actual array values are that of another array. So how about calling it ArrView?

shorter names

Perhaps the different names proposed here are deemed a bit long. One possibility would be to get rid of the "Arr" and simply have Iterator, Ref, View, Static, Owned. If mdarray is used as md these would become md::View or md::Iterator which I'd say are fine.

My gut feeling is that the longer names are preferable, though. It feels a bit presumptuous to introduce another Iterator trait.

fre-hu commented 2 months ago

Thanks. It is good to get comments, and naming is not easy and a compromise.

For Span, I agree Slice could work and is commonly used. The reason for using Span is mainly to align with C++, but it is not that important. One slight drawback with Slice is risk of confusion with normal slices, but it is the case also for Array.

Grid is maybe the most controversial, but it is being used for multidimensional arrays for example in some crates. One alternative could be Tensor, which is well known and used e.g. in C++ Eigen/xtensor and in Python. Would that work?

For Expression/Expr/... the idea is as you have seen to mirror the iterator types. But they can be seen as more than iterators, for example there are operators defined and they can be mixed with arrays. So they are not that different from expression templates in C++ libraries. I also wanted to have short names for expr/expr_mut/into_expr since they could be quite common.

grothesque commented 2 months ago

preliminary remark: Interestingly, Rust’s Vec is actually an example for an (initially) bad name: Rust took over the term “vector” for “dynamic array” from the C++, and even its creators considers that this was a bad choice.

However, due to the prominence of C++, one can argue that by now the term “vector” has acquired the meaning “dynamic array” as well, so I think it’s fine for Rust to use it.

Still, I think it would have been better to use Array instead of Vec in Rust's stdlib. Static arrays have special syntax to denote their type, so that there would be no StaticArray type, and using the term “static array” in the documentation seems fine. This would have made it very clear to everybody that both are very similar except for how memory for their payload is allocated.


Returning to mdarray, in my opinion the current names do work as pointers into documentation, but not as more. They have the advantage of being compact, but otherwise they convey no meaning even to someone who has experience with numerical computation. They do not even work as aid to memory.

This could be fine. After all it’s OK to introduce some jargon, and if the four terms array, grid, expr, and span are introduced clearly and prominently in the documentation, it may be not too much to ask from users to simply learn their meanings as used in this library.

However, I think that the names could be improved. This is my personal impression, and I write this down in the hope of being of some help, so feel free to completely ignore...


An Expr may be like an iterator, and like an expression template, but other array types convert easily into Expr, so for practical purposes do not differ in these regards. On the other hand what distinguishes Expr from other array types is that it owns its metadata but references its data. I would argue that the term "expr" does not convey this at all, while the term “view” does, and also has the advantage of already being used by other similar libraries.

For Span, in my understanding the crucial thing is that it is a simple pointer, a reference. It is not like a C++ mdspan. So the name "span" does not mean anything for those without C++ exposure, but for those with C++ exposure it’s actually confusing.

Same with Grid. Isn’t it like an Expr, only that it owns its data? I think that "tensor" would be equally confusing, because the name “tensor” could imply several different things (for example in physics a tensor is a table of numbers that transforms in a particular way), but not what is meant here. There do exist libraries that call all their numerical arrays "tensors", but when "expr" and "tensor" appear in the same library, one would never guess that they differ in how they hold data.

To me, the biggest problem is how to cleanly name the two concepts that are currently called Array and Grid. If there was only one of them, one could simply use the name “array” for it.

Perhaps this confusion is just the symptom of a deeper problem? Couldn’t one get rid of Array<f64, (Const<2>, Const<2>)> and replace it by Expr<f64, (Const<2>, Const<2>)>, etc.? Aren’t both effectively a pointer to data (that happens to be 'static' in the case ofArray`)? If this was feasible, three concepts would be sufficient:

But I suspect that I am missing something here...

fre-hu commented 2 months ago

One thing that causes confusion is that Expr/ExprMut are used for both array views and expressions. It is probably better to call them View/ViewMut instead. It makes it clear they are array views, and that they happen to implement Expression like some other types. To make it consistent, IntoExpr should no longer be an array type that that can be dereferenced to Span.

The existing Expr/ExprMut types could of course be kept for expressions, and that View/ViewMut implement IntoExpression instead of Expression. However I think it seems better to remove them to simplify and reduce the number of types.

Renaming Span to Slice is fine I think, as the latter is more well known. It also makes sense since it is unsized and must be used behind a reference like normal slices.

Regarding owned arrays, the Array type is just a wrapper around a normal array. It contains no pointer, so it can't be replaced with an array view. Given that there are two owned array types, it seems better to use Array for the fixed array since it is a wrapper and to make it consistent with normal arrays. I don't think it is that bad to use Tensor for a dynamic array, since it is commonly used for a multidimensional array in other libraries.

So with this my current thinking is to do these renamings:

grothesque commented 2 months ago

Both in Rust and Numpy/Python a slice is a view of a part of a container, just like a slice of bread is a part of the whole loaf. Crucially, except in trivial cases, the same container can be sliced in many different ways. But the type that is currently called Span doesn't share this property if I understand correctly. Isn't Span similar to what is proposed in https://github.com/rust-ndarray/ndarray/issues/879 or the "reference type" of https://github.com/akern40/ndarray-design/tree/design-doc?

In addition, in Rust references to slices are wide pointers, which would not be the case here and could lead to further confusion, since "slice" is such a well-known concept in Rust.


Thanks for explaining the difference between Array and Expr. Since Array is just a wrapper around built-in arrays I think that its name is justified.

"Tensor" is definitely much clearer then "grid".

What I don’t like in the resulting naming scheme is that the other types (i.e. View or Array) are just as much tensors as Tensor itself. The differences between the names do not match the differences between the concepts.

I think that the following naming scheme (for example) would be more symmetric: View (& ViewMut), Array, Owned, and Ref. All of these would be tensors that can be indexed etc. The names would highlight the most relevant differences between them.


Or perhaps you don’t like Ref because it’s actually not a reference in the sense of C++/Rust. After all, how should a &Ref be called, a reference to a reference? (Although Rust documentation is guilty of the same naming imprecision when it uses the term "slice" for both [T] and &[T]...)

The type that is currently called Span is like an OOP parent class for tensors. It’s derived classes are Expr, Array, etc. So why not call it Tensor? I think that could be very cool, i.e. the matmul example would become:

fn matmul(a: &DTensor<f64, 2>, b: &DTensor<f64, 2>, c: &mut DTensor<f64, 2>) {

This is much clearer, isn’t it? I think that I would like the following a lot:

Now we have one basis concept ("tensor"), and three realizations!

fre-hu commented 1 month ago

As I'm thinking, Span (or Slice) is very similar to the Rust slice type, and it can represent a part of the container but also the full container. For example with vec.iter() you don't get an iterator over a vector but an iterator over a slice, since vec will first dereference to a slice.

It does not matter how Span is represented internally, and what is important is the interface to the user. Now &Span is a thin pointer, but if/when custom DST is possible I would like to represent it with a wide pointer to the elements and to the mapping type. This is better for the compiler since direct references can have noalias annotations which should help optimizations.

Thanks for the feedback on Tensor vs Grid. I see what you mean about using Tensor for one type when other types are also tensors, but I think it is difficult to avoid. There is the same issue with normal arrays and Vec, that you should use slices in interfaces even though they are arrays/vectors. To me it is all just multidimensional arrays, and I want to find names that are reasonably correct and fit together.

Regarding array references, yes I think it would be better to use ArraySlice instead of ArrayRef. The reason is that &ArrayRef looks like a double reference, and for mutable references it is unclear if they should be called &mut ArrayRef or &mut ArrayMut. But otherwise the proposal is good.

grothesque commented 1 month ago

As I'm thinking, Span (or Slice) is very similar to the Rust slice type, and it can represent a part of the container but also the full container. For example with vec.iter() you don't get an iterator over a vector but an iterator over a slice, since vec will first dereference to a slice.

Sure, but currently in mdarray all spans that reference some array (say a Grid instance) are necessarily identical, because under the hood they are all pointers to the same RawSpan embedded within that grid instance. Is my understanding correct?

It does not matter how Span is represented internally, and what is important is the interface to the user. Now &Span is a thin pointer, but if/when custom DST is possible I would like to represent it with a wide pointer to the elements and to the mapping type. This is better for the compiler since direct references can have noalias annotations which should help optimizations.

Say we have an n-dimensional array and we want to iterate over some m-dimensional slices, where m <= n. In the general case, the mapping object must store the shape and the strides of the slices, so this can be quite a large object which won't fit itself into the wide pointer. However, all slices that one iterates over share the same shape and strides, so they can use the same mapping object. So, if I understand correctly, you would store the mapping as part of the iterator object, and then the wide pointer to a slice would point to the data (this pointer will wander among the data), and to the mapping inside the iterator. The lifetime of each slice will be bound to the lifetime of the iterator. Is this what you'd like to do one Rust supports custom DSTs?

If yes, then I fully agree that this future thing would be great, very useful, and should be called something like "slice".

If we can be sure that the API for these great future slices will match the API of the current restricted pseudo-slices, then I think there's a point in renaming current Span into Slice for the sake of future compatibility. (But document clearly that currently Slice is unfortunately not yet a real slice.)

But if we can't be sure, I think it's better to use a different name, and reserve Slice for future usage, when hopefully custom DSTs are available.

fre-hu commented 1 month ago

Yes and no. If you create a Span reference from a Grid, it will point to the RawSpan that is embedded in the grid. So if you copy the reference around or dereference the grid several times they will be identical. But you can also create a view to the same array, so that you get a View or ViewMut instance that contains a new RawSpan internally and can then be dereferenced.

So the difference compared to normal slices is that Span needs an array instance to point to, and not just the array elements. This can only be avoided with custom DSTs that are larger than two words. But otherwise the important function of Span is that it is the type that all arrays can dereference to, and can be used for example in function arguments.

For iteration, the shape and strides are stored in the mapping type in each array instance. If the iteration is over several arrays, they will each have a mapping type (since strides might be different per array) and the arrays are stored recursively in Zip structures.

When iterating over Span references, they must first be converted to array views that contain a RawSpan and lifetime. This is similar to when iterating over normal slices, since they must first be converted to Iter or IterMut.

grothesque commented 1 month ago

OK, so if Span is the closes thing available to a "real" slice, then I think that calling it "slice" is justified. So your current intention is to rename in the following way?

I think that's really quite good, certainly much clearer than the current names! In a way Tensor relates to Array just like Vec relates to "array" in the standard library, so that's OK as well, although I consider Vec is a misnomer (see links above)...