DataFrame performance is relatively slow and can be improved.
As this is a complex issue, it has sence to split it into several independent steps. This Epic is a container for related changes to keep it accessible from one place. Here is the list of proposed changes:
Improve Performance of DataFrame Arithmetic Operations
[x] Improve the speed of binary Arithmetic and Comparison operations on columns with the same underlying data type.
This can be achived by improving PrimitiveDataFrame.Clone method to use memory block coping. Avoid using CloneAs method, that involves type conversion for columns with the same data type
[ ] Reduce the number of copies in binary operations for columns with different type of underlying data (for example In32DataFrameColumn + Int16DataframeColumn).
We make copies of columns in the binary operation APIs mostly to reuse existing code. This is a wellknown issue. there are already tasks for eliminate excessive coping and g the binary operations behavior when types mismatch
[x] Increase speed of PrimitiveDataFrameColumn initialization, by fixing AppendMany(value, count) method, that is used in all PrimitiveDataFrameColumn constructors
[ ] Accessing DataFramePrimitiveColumn elements by index involve converting Memory to Span on each operation. That is very slow operation. we can consider using unmanaged memory in DataFrameBuffer instead. This also solves the issue with converting To/From Apache Arrow and heavy load on GC
DataFrame performance is relatively slow and can be improved.
As this is a complex issue, it has sence to split it into several independent steps. This Epic is a container for related changes to keep it accessible from one place. Here is the list of proposed changes: Improve Performance of DataFrame Arithmetic Operations
[x] Improve the speed of binary Arithmetic and Comparison operations on columns with the same underlying data type.
This can be achived by improving PrimitiveDataFrame.Clone method to use memory block coping. Avoid using CloneAs method, that involves type conversion for columns with the same data type
PR: https://github.com/dotnet/machinelearning/pull/6814 PR: https://github.com/dotnet/machinelearning/pull/6869
[ ] Reduce the number of copies in binary operations for columns with different type of underlying data (for example In32DataFrameColumn + Int16DataframeColumn).
We make copies of columns in the binary operation APIs mostly to reuse existing code. This is a wellknown issue. there are already tasks for eliminate excessive coping and g the binary operations behavior when types mismatch
Issue: https://github.com/dotnet/machinelearning/issues/5663 Issue: https://github.com/dotnet/machinelearning/issues/5665
[x] Increase speed of PrimitiveDataFrameColumn initialization, by fixing AppendMany(value, count) method, that is used in all PrimitiveDataFrameColumn constructors
PR: https://github.com/dotnet/machinelearning/pull/6822
[x] Improve Nullable support during arithmetic operations
Issue: https://github.com/dotnet/machinelearning/issues/6825
[ ] Consider how to implement Nullable support in Elementwise operations without any decrease in performance
Issue: https://github.com/dotnet/machinelearning/issues/6820
[ ] Use Simd vectorization
Issue: https://github.com/dotnet/machinelearning/issues/5695
[x] Add performance benchmarks
Issue: https://github.com/dotnet/machinelearning/issues/6826
Improve Performance of Filtering
[ ] Faster way to Filter
Issue: https://github.com/dotnet/machinelearning/issues/6164
Improve Performance of Indexing
[ ] Accessing DataFramePrimitiveColumn elements by index involve converting Memory to Span on each operation. That is very slow operation. we can consider using unmanaged memory in DataFrameBuffer instead. This also solves the issue with converting To/From Apache Arrow and heavy load on GC
Issue: https://github.com/dotnet/machinelearning/issues/5966 Issue: https://github.com/dotnet/machinelearning/issues/6715