SciSharp / NumSharp

High Performance Computation for N-D Tensors in .NET, similar API to NumPy.
https://github.com/SciSharp
Apache License 2.0
1.38k stars 192 forks source link

Support numpy.dtype #123

Closed Oceania2018 closed 5 years ago

Oceania2018 commented 6 years ago

We have to refactor the core NDArray class in order to support numpy.dtype. The key change will be the internal array storage. I'm going to retire the generic type design after the v0.4 released. Separating the internal storage will enhance the performance about 25%, and make the NDArray be a true NumPy in .NET. I've made some experiment, and seems it's feasible. The benchmark is also looking improved.

image

@dotChris90 @fdncred @AoaAquarius What do you think of it? Comment please.

dotChris90 commented 6 years ago

If you upload the benchmark tests and we look little bit we can choose.

I am not against or for it. Since it is a critical decision we need to check performance and use ability very strictly.

But yeah why not. If u think it's feasible and good for users I am in.

The last big change of NDArray (store all data in one array) was brilliant so we can do a change with generic too.

Oceania2018 commented 6 years ago

Uploaded the code, you can run dotnet NumSharp.Benchmark.dll nparange. I'll do more test later.

fdncred commented 6 years ago

@Oceania2018 , my fear on using object is that it introduces boxing which is notoriously slow. however, I think that if we benchmark everything we can make a decision based on benchmarks.

dotChris90 commented 6 years ago

True. That's why we need to benchmark.

If we can't find a performance good cast we must keep like now.

I really think about performance critical operations like matrix Multiplication, inverse, SVD

We can't do this with boxing. :) just stuff we need to investigate.

Otherwise dtype need to spit out the generic type like double as return value.

dotChris90 commented 6 years ago

I mean sth like.

Type dtype { get { return typeof(T) ; } }

fdncred commented 6 years ago

ya, I like passing in a data type and then switch/case on that type.

dotChris90 commented 6 years ago

Yes but usually operations like matrix dot and inv and so on has 2 or 3 data arrays which need be cast.

I am 100 % sure switch case works but.... The as keyword I am not sure if it can cast object array to double array.

At the moment. Inv, dot, operators etc works with 1 switch and 1 or 2 as keywords.

The result array is switch candidate and the 1 or 2 operands arr1 and arr2 are cast with as. But not sure if keyword works with object array or just generic types.

Oceania2018 commented 6 years ago

Sorry it might confused, I’m not saying using object[], I’ll use int[] to store int values and double[] to store double values. That is what I am saying, dedicated array to store appropriate data type. As you can see the performance is improved, no boxing needed.

fdncred commented 6 years ago

I was confused because your class doesn't use object[] LOL. I like what you've done with dtype.

dotChris90 commented 6 years ago

Ah sorry got it.

But unfortunately I don't see any benefit for this than.

The reason for generic is really that we don't have to implement similar classes x times. Like List of int.

If u worry the dtype it's simple our generic type.

So return typeof (T).

At the moment don't understand how u want to support different numerical types. All nd array have int [], double [] etc array? Maybe if u show example of ndarray of double and one of int things gonna be clear :)

dotChris90 commented 6 years ago

Or I say like that lol If u give me an example of how to create an ndarray of double and int I try to implement the operation and dot method.

If it's easy to implement i say let's rewrite. But this operation for linear algebra are quite important in Numpy so that's why I always want to try them first when rewrite. :)

dotChris90 commented 6 years ago

@Oceania2018 @fdncred just some considerations.... Again.... Sorry for be so annoyingful lol.

If u really want to give a NDArray for every possible numerical type an array... So most of the properties are always null since just one storage is in use.

Plus we must now specify what are possible data types. Double, float, complex, quaternion,... And in theory even chars and timedate (for pandas index timedate is a must).

I just worry the following.

With this suggested strategy we have to take a lot of switch in every method because every type has its own storage. This will end in larger code. For operations, dot, inv,... That's fine since we will have there all always a switch.

The generic T[] strategy has the benefit that in some situation we don't need switch since we know it's an array. So indexing, transpose, permutation,... All operations which are independent of the dtype, does not require a switch and we just code once.

That was all :)

fdncred commented 6 years ago

I traditionally vote for whatever provides the best performance. Regarding multiple case/switch statements in every class, I think that is what T4 templates are for. Perhaps we should investigate their usage.

Docs

dotChris90 commented 6 years ago

Unfortunately Templates does not work with net core....

LOL was also thinking about this before :)

But anyway if somebody of u know a code generator for this use case write it here ;)

fdncred commented 6 years ago

This link is about core + t4

dotChris90 commented 6 years ago

Ok seems things slowly change lol OK

Thanks for sharing the link. Will read it tomorrow

Oceania2018 commented 6 years ago
var nd2 = new NDArray<double>();
nd2 = np.arange(1000 * 8 * 8 * 8).reshape(1000, 8, 8, 8);
var nd3 = nd2.AMin(0);
var nd2 = new NDArrayWithDType(NDArrayWithDType.double8);
nd2 = nd2.arange(1000 * 8 * 8 * 8, 0, 1).reshape(1000, 8, 8, 8);
var nd3 = nd2.AMin(0);

image

dtype design will make NumSharp be more fit for Pandas.NET. That's what I'm trying to do. I abstracted the NDStorage class to persist all kinds of data types, and no performance compromised.

fdncred commented 6 years ago

While doing some research for Span<T> I stumbled upon this set of classes that seem somewhat similar to what we're trying to do here. It's worth a look.

dotChris90 commented 6 years ago

Ok now see more clear.

The storage is a brilliant idea. I think with this strategy pandas should be easier.

The storage could also contain methods for casting or methods that manipulate the arrays so its easier to handle array stuff in one class.

OK nice idea.

Oceania2018 commented 6 years ago

@dotChris90 I'm glad you've got the point. so, are we going to move forward? replace the NDArray?

dotChris90 commented 6 years ago

I am in. Yes we will do. When find time will try it out by operation methods and dot.

Oceania2018 commented 6 years ago

Branched v0.4 and released to NuGet, will refactor on master.

dotChris90 commented 6 years ago

@Oceania2018 also check #127. We must think about abstract interfaces. Otherwise in future rewriting will be very hard....

Oceania2018 commented 6 years ago

@dotChris90 I'll think about that. The difficult thing is we can't provider a interface like T[] Get<T>() or object Get() due to performance issue.

dotChris90 commented 6 years ago

Performance is quite important but following software pattern, good style and a clean architecture is a must. And just warn you. NumSharp is not too big at moment but in 2 years? That's the reason numpy pandas also has abstract classes / interfaces. Yes numpy and pandas also has abstract classes for this reason. :p

If the storage has a cast method like Get< T > which output a cast of the whole array like double [] its OK. A method call is just performance critical if u call the method in a loop. If the return a whole array its OK.

I will experiment also around when find time. Just want to say : slowly we need to start thinking on software design and not just performance. The fans of NumSharp need it. :)

dotChris90 commented 6 years ago

@Oceania2018 @fdncred hm guys not sure if you know this before .... just one thing I just found out.

These days we talked about dType in a way that every NDArray has multiple properties in form of "int32[]", "int64[]", "double[]", "Complex[]" and so on. Yes that is an option. I just find not the best one since in OOP world we should avoid unused properties. And face it. A double ndarray will not use int[] so we have a lot of unused properties.

So what about a storage class with just one non generic property for all? Impossible? not. System.Array would be a nice option. Look :

Array a = new double[]{1,2,3,4};
var b = a as double[]; // I was shocked that casting is so easy without boxing

Array would be a property for all possible data types : doubles, strings, complex, int, int64, ... In a system. Array any array can be stored. Independent of type. The Storage class should have a method for casting that array into correct type and return e.g. a double[].

Moreover the storages properties (so like double[], int[] or as suggested now array) should be protected and not public. We are not Python - C# using protected etc. to show developers and users "don't touch! this can change any time!" ;)

Oceania2018 commented 6 years ago

@dotChris90 Totally agree with you. We will find the balance between design elegant and performance. Pull the cod check if you are OK with the new NumPy class. I've finished the np.arange and np.reshape.

Oceania2018 commented 6 years ago

Array looks like a good option.

dotChris90 commented 6 years ago

Yeah. Your code looks promising. Just the value property of storage gives a object [] which must be boxed and can't be cast via as keyword. This is the only thing was in my mind now.

Side information : The reason for this is that object [] and double [] (or int[] or any others) are like sister and brother. They are both children of system. Array. But the cast via " as" keyword (which had best performance) is just possible with classes in hierarchy so parent and children not brothers and sister. So system. Array can be cast and object[] must be boxed.

dotChris90 commented 6 years ago

BTW if u don't mind I can do the abstract stuff next week when back Germany.

I know at moment it's not too important for functionality but just for architecture and design.

Oceania2018 commented 6 years ago
Array a = new double[]{1,2,3,4};
// can't access element in the way? a[0]
var b = (a as double[])[0];

if Array works, we might not need NDStorage. The Array is the storage.

dotChris90 commented 6 years ago

Ah be careful. We need still storage for casting. The array can not indexing. (alternative see at end setvalue and get value)

It sounds strange and somehow is actual strange. Indexing just possible if implement the interface IList.

Array just mean : homogeneous area in memory.

Double [] implement IList and is child of array.

But in my opinion that's fine. Usually I have to cast always the array to true data type even with generic. Operation methods, dot, inv quite all algebraic operations requires a cast since not all children of array can use +-/*....

I am not sure how many NumSharp methods works just with indexing and without +-/*.

Anyway an alternative is setvalue and get value. If I remember well.

Oceania2018 commented 6 years ago

I'm testing for Array, if no performance compromise. I'll use it instead of int[] and double[] those ugly things.

dotChris90 commented 6 years ago

And one more point.

Since array class is quite abstract the return value of getvalue is object.

dotChris90 commented 6 years ago

At least for matrix multiplication etc there should be not too much performance problems if we use the as keyword cast before all for loops lol

Will also look maybe next days.

Oceania2018 commented 6 years ago

Array looks good in performance. You can do the abstract for IStorage when you are available. I just switched to Array LOL.

Oceania2018 commented 6 years ago

I've tried the new NDArray, it's very awesome. In my use case, everything goes well. I can write function with unified interface, which return the same NDArray, without T1, T2, Tn. Pandas.NET will be easier to build than before, because we have true dynamic array.

dotChris90 commented 6 years ago

That rocks. :)

I saw the tests and changes. Now really satisfied. U did a great job.

I just wish C# had sth like numerical class. In Matlab (as an example) double etc are children of numeric. So you can add an int to double etc since all implement the operations etc. In c# they are children of value types and do not implement always operations. That sucks but can't be changed.

Oh so the indexing output should be a value type and not object. Slightly different. :)

Anyway you did an awesome job. Looks very nice.

dotChris90 commented 5 years ago

hm I still think the generic NDArray could be good in some situations like direct indexing (e.g. double value = np1[1,2,3]) - for this reason I made pull request #128 . We could deliver a generic and non generic version but the generic version is a child of non generic. In this way we can deliver a sweet indexing version but also go on Pandas with the non generic version. Our users shall decide by themselves what fits better - generic or non generic to their projects.

P.S. the pull request just shows a draft - not the 100% implemented class.

dotChris90 commented 5 years ago

@Oceania2018 we can close this here - right?

Oceania2018 commented 5 years ago

close now