Simplify TensorDescriptor::GetElementSpace()

It's a tiny improvement of senseless overcomplications in TensorDescriptor::GetElementSpace(). Got ridden of two extra allocations and initializations and computed everything in a single pass.

I guess it's insignificant performance improvement for the overall library, but that function became ~1.6 times faster: Number of tests: 134217728 (1d-5d cases) New function average time (ns): 20.4021 Old function average time (ns): 32.94695 Gain (times): 1.61488

In terms of dynamically executed instructions, it's even worse: ~35.2 per call vs ~431.2 per call (including subsequent malloc/free)

I'll delete the test when CI passed, there is not much sense to check that function over the previous implementation.

ROCm / MIOpen

Simplify TensorDescriptor::GetElementSpace() #3380