ARM-software / CMSIS-DSP

CMSIS-DSP embedded compute library for Cortex-M and Cortex-A
https://arm-software.github.io/CMSIS-DSP
Apache License 2.0
454 stars 122 forks source link

There are several functions in the FilteringFunctions that should not have DSP extended macros. #193

Open JoleenSun opened 2 days ago

JoleenSun commented 2 days ago

P extension instruction does not work with floating point types. But function arm_conv_f32/arm_conv_partial_f32/arm_correlate_f32/arm_correlate_f16 has "#if defined(ARM_MATH_DSP)" ,which impacts on the performance when testing without any extension but the "ARM_MATH_LOOPUNROLL" is defined.

christophe0606 commented 2 days ago

@JoleenSun I agree that the test with ARM_MATH_DSP is not useful here and could be removed (What is tested in fact here is that we want a function that may be unrolled or not).

So the problem is : you'd like to have a function that is not impacted by ARM_MATH_LOOP_UNROLL. Because if I just disable the ARM_MATH_DSP cases when ARM_MATH_LOOP_UNROLL is defined, then it selects the C version of the function where the loop is never unrolled (whatever the value of ARM_MATH_LOOP_UNROLL).

Unfortunately, the effect of ARM_MATH_LOOP_UNROLL and the manual unrolling is highly compiler dependent. Most of the times it improves the performances but there are some cases where it does not.

And relying on the compiler to do the automatic unrolling has shown (in our tests) that most of the case it is even less predictable that doing the manual unrolling.

So, we will always have cases where enabling ARM_MATH_LOOP_UNROLL may decrease the performance of a specific function on a specific version of a compiler.

If it is a case where it always decrease the performance (whatever the compiler version) then the manually unrolled version should be removed. But it would require tests with several compiler versions to be really sure.

JoleenSun commented 2 days ago

@christophe0606 Thanks for your reply.

I think I get what you mean. But I have another question: why can't we remove the "ARM_MATH_DSP "macro but keep "ARM_MATH_LOOP_UNROLL", so that the C version have both a version with loop_unrolling and a version without.

In this case, maybe we only need to test the two C versions both without loop unrolling: based on the current fuction, a version is define the "ARM_MATH_DSP " but disable the "ARM_MATH_LOOP_UNROLL", the other is just disable the "ARM_MATH_DSP "(also disable RISCV_MATH_VECTOR).

christophe0606 commented 2 days ago

@JoleenSun I can remove the ARM_MATH_DSP in this case (and I'll tag this issue as enhancement so that this change is done at some point).

Now the problem is that `ARM_MATH_LOOP_UNROLL" applies to several functions. So if you unset it because it gives better performance for this function perhaps it will give worse performance for another function you use.

One should have a better granularity and be able to set this for each function. But I have not yet found a developer friendly way to do it.

JoleenSun commented 2 days ago

Thanks! Well, maybe I didn't express myself clearly, sorry for my poor English.

I just mean that the "ARM_MATH_DSP " maybe can be removed for functions with float-type(in which case "ARM_MATH_LOOP_UNROLL" also can be reserved), not desiring the granularity of "ARM_MATH_LOOP_UNROLL" needs to be small enough as can be set for each function separately.

And all I said is just related to functions arm_conv_f32/arm_conv_partial_f32/arm_correlate_f32/arm_correlate_f16.

christophe0606 commented 2 days ago

@JoleenSun I think I understand the issue. But i think it is not just restricted to arm_conv_f32/arm_conv_partial_f32/arm_correlate_f32/arm_correlate_f16 since the ARM_MATH_LOOP_UNROLL is everywhere.

But for the float function, I totally agree than ARM_MATH_DSP is useless and will have to be removed.

Anyway, your feedback is taken into account and I have tagged this issue as enhancement.

JoleenSun commented 2 days ago

Haha, thank you so much.

Actually, the point is that the “ARM_MATH_DSP ” but not "ARM_MATH_LOOP_UNROLL ", as long as the use of "ARM_MATH_LOOP_UNROLL "is not bound to the "ARM_MATHDSP ", and can be set independently, the current adaptation and support are already perfect(personal opinion ^^ ).