Closed Oceania2018 closed 5 years ago
If you upload the benchmark tests and we look little bit we can choose.
I am not against or for it. Since it is a critical decision we need to check performance and use ability very strictly.
But yeah why not. If u think it's feasible and good for users I am in.
The last big change of NDArray (store all data in one array) was brilliant so we can do a change with generic too.
Uploaded the code, you can run dotnet NumSharp.Benchmark.dll nparange
.
I'll do more test later.
@Oceania2018 , my fear on using object is that it introduces boxing which is notoriously slow. however, I think that if we benchmark everything we can make a decision based on benchmarks.
True. That's why we need to benchmark.
If we can't find a performance good cast we must keep like now.
I really think about performance critical operations like matrix Multiplication, inverse, SVD
We can't do this with boxing. :) just stuff we need to investigate.
Otherwise dtype need to spit out the generic type like double as return value.
I mean sth like.
Type dtype { get { return typeof(T) ; } }
ya, I like passing in a data type and then switch/case on that type.
Yes but usually operations like matrix dot and inv and so on has 2 or 3 data arrays which need be cast.
I am 100 % sure switch case works but.... The as keyword I am not sure if it can cast object array to double array.
At the moment. Inv, dot, operators etc works with 1 switch and 1 or 2 as keywords.
The result array is switch candidate and the 1 or 2 operands arr1 and arr2 are cast with as. But not sure if keyword works with object array or just generic types.
Sorry it might confused, I’m not saying using object[], I’ll use int[] to store int values and double[] to store double values. That is what I am saying, dedicated array to store appropriate data type. As you can see the performance is improved, no boxing needed.
I was confused because your class doesn't use object[] LOL. I like what you've done with dtype.
Ah sorry got it.
But unfortunately I don't see any benefit for this than.
The reason for generic is really that we don't have to implement similar classes x times. Like List of int.
If u worry the dtype it's simple our generic type.
So return typeof (T).
At the moment don't understand how u want to support different numerical types. All nd array have int [], double [] etc array? Maybe if u show example of ndarray of double and one of int things gonna be clear :)
Or I say like that lol If u give me an example of how to create an ndarray of double and int I try to implement the operation and dot method.
If it's easy to implement i say let's rewrite. But this operation for linear algebra are quite important in Numpy so that's why I always want to try them first when rewrite. :)
@Oceania2018 @fdncred just some considerations.... Again.... Sorry for be so annoyingful lol.
If u really want to give a NDArray for every possible numerical type an array... So most of the properties are always null since just one storage is in use.
Plus we must now specify what are possible data types. Double, float, complex, quaternion,... And in theory even chars and timedate (for pandas index timedate is a must).
I just worry the following.
With this suggested strategy we have to take a lot of switch in every method because every type has its own storage. This will end in larger code. For operations, dot, inv,... That's fine since we will have there all always a switch.
The generic T[] strategy has the benefit that in some situation we don't need switch since we know it's an array. So indexing, transpose, permutation,... All operations which are independent of the dtype, does not require a switch and we just code once.
That was all :)
I traditionally vote for whatever provides the best performance. Regarding multiple case/switch statements in every class, I think that is what T4 templates are for. Perhaps we should investigate their usage.
Unfortunately Templates does not work with net core....
LOL was also thinking about this before :)
But anyway if somebody of u know a code generator for this use case write it here ;)
Ok seems things slowly change lol OK
Thanks for sharing the link. Will read it tomorrow
var nd2 = new NDArray<double>();
nd2 = np.arange(1000 * 8 * 8 * 8).reshape(1000, 8, 8, 8);
var nd3 = nd2.AMin(0);
var nd2 = new NDArrayWithDType(NDArrayWithDType.double8);
nd2 = nd2.arange(1000 * 8 * 8 * 8, 0, 1).reshape(1000, 8, 8, 8);
var nd3 = nd2.AMin(0);
dtype
design will make NumSharp be more fit for Pandas.NET. That's what I'm trying to do. I abstracted the NDStorage
class to persist all kinds of data types, and no performance compromised.
While doing some research for Span<T>
I stumbled upon this set of classes that seem somewhat similar to what we're trying to do here. It's worth a look.
Ok now see more clear.
The storage is a brilliant idea. I think with this strategy pandas should be easier.
The storage could also contain methods for casting or methods that manipulate the arrays so its easier to handle array stuff in one class.
OK nice idea.
@dotChris90 I'm glad you've got the point. so, are we going to move forward? replace the NDArray?
I am in. Yes we will do. When find time will try it out by operation methods and dot.
Branched v0.4 and released to NuGet, will refactor on master.
@Oceania2018 also check #127. We must think about abstract interfaces. Otherwise in future rewriting will be very hard....
@dotChris90 I'll think about that. The difficult thing is we can't provider a interface like T[] Get<T>()
or object Get()
due to performance issue.
Performance is quite important but following software pattern, good style and a clean architecture is a must. And just warn you. NumSharp is not too big at moment but in 2 years? That's the reason numpy pandas also has abstract classes / interfaces. Yes numpy and pandas also has abstract classes for this reason. :p
If the storage has a cast method like Get< T > which output a cast of the whole array like double [] its OK. A method call is just performance critical if u call the method in a loop. If the return a whole array its OK.
I will experiment also around when find time. Just want to say : slowly we need to start thinking on software design and not just performance. The fans of NumSharp need it. :)
@Oceania2018 @fdncred hm guys not sure if you know this before .... just one thing I just found out.
These days we talked about dType in a way that every NDArray has multiple properties in form of "int32[]", "int64[]", "double[]", "Complex[]" and so on. Yes that is an option. I just find not the best one since in OOP world we should avoid unused properties. And face it. A double ndarray will not use int[] so we have a lot of unused properties.
So what about a storage class with just one non generic property for all? Impossible? not. System.Array would be a nice option. Look :
Array a = new double[]{1,2,3,4};
var b = a as double[]; // I was shocked that casting is so easy without boxing
Array would be a property for all possible data types : doubles, strings, complex, int, int64, ... In a system. Array any array can be stored. Independent of type. The Storage class should have a method for casting that array into correct type and return e.g. a double[].
Moreover the storages properties (so like double[], int[] or as suggested now array) should be protected and not public. We are not Python - C# using protected etc. to show developers and users "don't touch! this can change any time!" ;)
@dotChris90 Totally agree with you. We will find the balance between design elegant and performance. Pull the cod check if you are OK with the new NumPy
class. I've finished the np.arange
and np.reshape
.
Array
looks like a good option.
Yeah. Your code looks promising. Just the value property of storage gives a object [] which must be boxed and can't be cast via as keyword. This is the only thing was in my mind now.
Side information : The reason for this is that object [] and double [] (or int[] or any others) are like sister and brother. They are both children of system. Array. But the cast via " as" keyword (which had best performance) is just possible with classes in hierarchy so parent and children not brothers and sister. So system. Array can be cast and object[] must be boxed.
BTW if u don't mind I can do the abstract stuff next week when back Germany.
I know at moment it's not too important for functionality but just for architecture and design.
Array a = new double[]{1,2,3,4};
// can't access element in the way? a[0]
var b = (a as double[])[0];
if Array
works, we might not need NDStorage
. The Array
is the storage.
Ah be careful. We need still storage for casting. The array can not indexing. (alternative see at end setvalue and get value)
It sounds strange and somehow is actual strange. Indexing just possible if implement the interface IList.
Array just mean : homogeneous area in memory.
Double [] implement IList and is child of array.
But in my opinion that's fine. Usually I have to cast always the array to true data type even with generic. Operation methods, dot, inv quite all algebraic operations requires a cast since not all children of array can use +-/*....
I am not sure how many NumSharp methods works just with indexing and without +-/*.
Anyway an alternative is setvalue and get value. If I remember well.
I'm testing for Array
, if no performance compromise. I'll use it instead of int[] and double[] those ugly things.
And one more point.
Since array class is quite abstract the return value of getvalue is object.
At least for matrix multiplication etc there should be not too much performance problems if we use the as keyword cast before all for loops lol
Will also look maybe next days.
Array
looks good in performance. You can do the abstract for IStorage when you are available. I just switched to Array
LOL.
I've tried the new NDArray
, it's very awesome. In my use case, everything goes well. I can write function with unified interface, which return the same NDArray
, without T1, T2, Tn. Pandas.NET will be easier to build than before, because we have true dynamic array.
That rocks. :)
I saw the tests and changes. Now really satisfied. U did a great job.
I just wish C# had sth like numerical class. In Matlab (as an example) double etc are children of numeric. So you can add an int to double etc since all implement the operations etc. In c# they are children of value types and do not implement always operations. That sucks but can't be changed.
Oh so the indexing output should be a value type and not object. Slightly different. :)
Anyway you did an awesome job. Looks very nice.
hm I still think the generic NDArray could be good in some situations like direct indexing (e.g. double value = np1[1,2,3]) - for this reason I made pull request #128 . We could deliver a generic and non generic version but the generic version is a child of non generic. In this way we can deliver a sweet indexing version but also go on Pandas with the non generic version. Our users shall decide by themselves what fits better - generic or non generic to their projects.
P.S. the pull request just shows a draft - not the 100% implemented class.
@Oceania2018 we can close this here - right?
close now
We have to refactor the core
NDArray
class in order to supportnumpy.dtype
. The key change will be the internal array storage. I'm going to retire the generic type design after the v0.4 released. Separating the internal storage will enhance the performance about 25%, and make theNDArray
be a trueNumPy
in.NET
. I've made some experiment, and seems it's feasible. The benchmark is also looking improved.@dotChris90 @fdncred @AoaAquarius What do you think of it? Comment please.