Closed t-b closed 1 year ago
I cannot recreate this issue because the data in TUF_XOP is lost and not stored alongside the project file (for the purpose of debugging it could be better if we transfer the TUF_XOP waves as global waves after the test run).
I will look into the algorithm. Maybe I can find something.
I looked into the implementation and found some possible critical hot paths:
GetMaxFuncCount
with a time approximation of O(proc + proc * lbl)
CollectFunctions
with a time approximation of O(proc * func + proc * lines + proc * lbl)
SearchHighestWithMeta
with a time approximation of O(cnt * proc * func)
I used the following abbreviations:
proc
: Number of instrumented procedure fileslines
: The maximum number of lines of an instrumented procedure filefunc
: The maximum number of functions of an instrumented procedure filelbl
: The constant cost of resolving a wave dimension label. This is added when used inside a loop.cnt
: The number of results that should be returned.GetMaxFuncCount
does resolve a wave dimension label in a loop which can be optimized and the new time approximation would be O(proc)
.
CollectFunctions
is a bit tricky. I can't see any optimizations here.
SearchHighestWithMeta
sorts a wave by iteratively searching for the largest element and removing them from the wave. I think this can be improved if we use a sorting function in the first place and then take the first cnt
elements from it.
Using a sort function in the first place for SearchHighestWithMeta
would change the time complexity to O(log(proc * func) * proc * func + cnt)
. This would only be an improvement if cnt
is very large or proc
and func
are small. We need a way to eliminate unnecessary data pretty fast.
For the example experiment is proc
87 and func
317. This would mean SearchHighestWithMeta
must find the 50 largest items from a wave of 27,579 entries.
One idea of a filter would be the following algorithm:
proc
the largest value of any of its func
.proc
s according to their largest value.cnt
proc
s. Throw away the remaining. This would reduce the data to 50 proc
times 317 func
.proc
s. This is now our pivot element.proc
s and throw away all func
values that are smaller than the pivot element. This would reduce the data to 50 proc
times 50 func
.After filtering we can do a quick sort with a time approximation of O(cnt * log(cnt))
, which is faster than iteratively selecting the largest element (time approximation is O(cnt²)
). And then select the top cnt
elements.
Some steps can be done in parallel (using MultiThread
) and skipped if proc
(or func
) is smaller than cnt
.
The resulting time approximation would be O(proc * func + cnt² * log(cnt²))
.
Execution time is really close to a minute:
•print secs2time(datetime, 3, 3); ShowTopFunctions(50, mode = UTF_ANALYTICS_FUNCTIONS); print secs2time(datetime, 3, 3)
21:46:53.697
[...]
21:47:52.981
Edited the original comment with updated pxp.
I have done some function profiling and the results are ... interesting ...
Total time: 37.2419, Time in Function code: 37.2419 (100%)
Top function percentages:
Function igortest-tracing-analytics.ipf GetTotals: 100%
Annotated Top Functions:
*******************************************************************************************
Function: igortest-tracing-analytics.ipf igortest-tracing-analytics.ipf#GetTotals; Percent total 100%
*******************************************************************************************
[00] |static Function/WAVE GetTotals()
[00] | variable i, numWaves
[00] |
[00] | TUFXOP_GetStorage/N="IUTF_TestRun" storage
[00]* | numWaves = DimSize(storage, UTF_ROW)
[00] | WAVE/ZZ totals
[00]* | for(i = 0; i < numWaves; i++)
[00]* | WAVE/WAVE/Z entryOuter = storage[i]
[00]* | if(!WaveExists(entryOuter))
[00] | continue
[00] | endif
[00]* | WAVE entry = entryOuter[0]
[00]* | if(WaveExists(totals))
[100]********** | totals += entry
[00] | else
[00]* | Duplicate/FREE=1 entry, totals
[00] | endif
[00] | endfor
[00] |
[00] | if(!WaveExists(totals))
[00] | Make/N=0/FREE=1 totals
[00] | endif
[00] |
[00] | return totals
[00] |End
The storage wave "IUTF_TestRun" contains 449 rows out of which are 426 are not null. Each row points to another wave with 1000 x 3 x 87 entries. They are about 117,189,000 values to add...
If I remove the line totals += entry
, the function ShowTopFunctions
is finished in 0.0345498 seconds! We need to improve the performance of GetTotals
.
(I miss interpreted the GetTotals
function completely and just assigned O(lines)
to it. But the data is several times larger than that...)
I found a possible solution. We can copy all the data of the wave reference wave into a normal 4D (1000 x 3 x 87 x 449 entries) wave in a multithreaded operation. After that we need to call SumDimension/D=3
and have our totals.
@Garados007 Go for it!
I could reduce the time to 9.147 seconds but I am not really happy about it. Here is the function profiling result:
Total time: 9.14777, Time in Function code: 8.72328 (95.4%)
Top function percentages:
Function igortest-tracing-analytics.ipf GetTotals2: 100%
Annotated Top Functions:
*******************************************************************************************
Function: igortest-tracing-analytics.ipf igortest-tracing-analytics.ipf#GetTotals2; Percent total 100%
*******************************************************************************************
[00] |static Function/WAVE GetTotals2()
[00] | variable i, numWaves
[00] | variable bufX, bufY, bufZ
[00] |
[00]* | TUFXOP_GetStorage/N="IUTF_TestRun" storage
[00] | numWaves = DimSize(storage, UTF_ROW)
[00] |
[00] | for(i = 0; i < numWaves; i++)
[00] | WAVE/WAVE/Z entryOuter = storage[i]
[00] | if(!WaveExists(entryOuter))
[00] | continue
[00] | endif
[00]* | WAVE entry = entryOuter[0]
[00] |
[00] | bufX = DimSize(entry, UTF_ROW)
[00] | bufY = DimSize(entry, UTF_COLUMN)
[00] | bufZ = DimSize(entry, UTF_LAYER)
[00] |
[00] | break
[00] | endfor
[00] |
[10]* | Make/FREE=1/N=(bufX, bufY, bufZ, numWaves) buffer
[66]******* | MultiThread storage[] = CopyTotalsToBuffer(p, storage, buffer)
[00] |
[24]** | SumDimension/D=(UTF_CHUNK)/DEST=totals/Y=(-1) buffer
[00] |
[00]* | return totals
[00] |End
And the implementation of CopyTotalsToBuffer
:
threadsafe static Function/WAVE CopyTotalsToBuffer(i, storage, buffer)
variable i
Wave/WAVE storage
Wave buffer
WAVE/WAVE/Z entryOuter = storage[i]
if(!WaveExists(entryOuter))
return entryOuter
endif
WAVE entry = entryOuter[0]
buffer[][][][i] = entry[p][q][r]
return entryOuter
End
@t-b @MichaelHuth Do you have an idea to copy the entries to the buffer even faster than the above method? I had tried Concatenate { waveRefWave }, buffer
but the results are really bad (but slightly better than the original implementation).
There is a reason I am not really happy about the result. SumDimension
has to access all values from the buffer at least once and create a new wave with all the data. Copying the original data to the buffer 1:1 does take 3-times the time than SumDimension
needs and I do it multi-threaded!
Why not just use
FastOp totals += entry
in the orignal code? This makes it run in 4s here.
ShowTopFunctions(50, mode = UTF_ANALYTICS_FUNCTIONS)
takes around a minute with the attached experiment.Steps: