Which issue are you addressing?

Closes #108

How have you addressed the issue?

The approach taken before to optimize collapse(2) and collapse(3) was implemented for collapse(4) and higher.

How have you tested your patch?

All unit tests pass. Performance is greatly improved. Tests were performed with BenchmarkDotNet running an NxNxNxN parallel kernel with an empty body and with N being 20, 40, 80, and 160. Static scheduling was used with the default chunk size.

This is the performance pre-optimizations:

Method	len	Mean	Error	StdDev	Median	Completed Work Items	Lock Contentions
Bench	20	3.398 ms	0.0177 ms	0.0166 ms	3.404 ms	-	8.4023
Bench	40	33.510 ms	0.3868 ms	0.3618 ms	33.577 ms	-	6.0625
Bench	80	502.138 ms	19.1668 ms	56.5139 ms	517.581 ms	-	4.0000
Bench	160	8,128.888 ms	328.1286 ms	967.4945 ms	8,267.153 ms	-	9.0000

And post-optimizations:

Method	len	Mean	Error	StdDev	Completed Work Items	Lock Contentions
Bench	20	2.032 ms	0.0396 ms	0.0639 ms	-	6.2188
Bench	40	16.386 ms	0.3221 ms	0.4619 ms	-	6.5938
Bench	80	232.711 ms	4.5844 ms	4.9053 ms	-	8.0000
Bench	160	3,508.325 ms	11.5801 ms	9.6699 ms	-	9.0000

Not only is there a >2x performance improvement, but the error and standard deviation is down significantly, making performance more predictable.

computablee / DotMP

Implement optimized index calculations for collapse(4) and higher #114

Which issue are you addressing?

How have you addressed the issue?

How have you tested your patch?

Codecov Report