Closed hiptmair closed 5 years ago
I think adding such a method has the potential to become a premature optimization. I believe that there will be many other bottlenecks in our code that will be more important to optimize than to solve the issue with heap allocation. I think the issue is quite a bit smaller here than in Issue #5 because here it's one allocation per method call and not one allocation per evaluation point.
If we find out later that we loose a lot of time waiting for heap allocation we should of course come back and reconsider this. But I think ngsolve's approach of passing an additional allocator would be a much better fix for this problem.
The runtime test in lf.experiments.efficiency.runtime_test
clearly shows that returning and subsequently accessing arrays through a RadomAccessRange
incurs substantial overhead.
efficiency $ ./lf.experiments.efficiency.runtime_test
Runtime test for range access
I. Access through RandomAccessRange
55.645970s wall, 55.450000s user + 0.070000s system = 55.520000s CPU (99.8%)
II. Returning std::vector's
2.659267s wall, 2.650000s user + 0.000000s system = 2.650000s CPU (99.7%)
III. Returning result through reference
1.325443s wall, 1.320000s user + 0.000000s system = 1.320000s CPU (99.6%)
The code was compiled in 'release mode'.
In light of the runtime overhead it is again worth considering returning vectors/matrices through references in the case of the methods Global()
, Jacobian()
, JacobianInverseGramian()
, and IntegrationElement()
of the Geometry
interface.
I've also run the efficiency tests and on my machine the difference between II and III was 100% due to the fact that in II the vector has been allocated for every call whereas in III it was allocated only once. If you allocate the std::vector in III in the loop, you will see that II is at least as fast as III (on gcc even faster). So the difference is completly explained by the system calls (malloc). If we want to avoid these calls, a "LocalHeap" object (cf ngsolve) would be a much nicer solution than returning matrices through mutable references. Overall, I believe that the runtime of the assembly procedure will be very small when compared to the solution phase (for sufficently large systems, assuming that the assembly is also parallelized).
Allocation outside the loop is relevant for FE computations: Imagine local assembly with a fixed quadrature rule for every cell. Then the return values for the methods Global()
, Jacobian()
, etc. can be stored in the same vector for all the cells. I am going to devise a more elaborate test of efficiency involving a real assembly.
Yeah, but what you should not forget is that for more complicated geometry mappings (second order, nurbs) things are simplified a lot if temporary matrices can be allocated inside the Global()
or Jacobian()
methods. If you want to optimize for speed, you want to allocate those temporary matrices also on the stack. Passing mutable references into Global()
and Jacobian()
cannot help with that.
Of course, for more complicated parametric entities, the cost of local computations might finally render the cost of allocation negligible. However,
Returning dynamic Eigen::Matrix
types is now firmly established in LehrFEM++
For the sake of efficiency it might be a good idea to provide alternative versions of some methods of the Geometry interface class, for instance:
void Global(const Eigen::MatrixXd& local,Eigen::MatrixXd &global) const;
Here, global is supposed to be a matrix of the right size for returning the result of the function.
This version of the method and of some other methods can avoid allocation on the heap, which could be done only once in the calling code outside the inner loops.
In fact, the implementation of
virtual Eigen::MatrixXd Global(const Eigen::MatrixXd& local) const
could be done at the level of geometry_interface already, just calling the above method after defining a suitable matrix.