ShowTopFunctions is too slow

t-b commented 1 year ago

ShowTopFunctions(50, mode = UTF_ANALYTICS_FUNCTIONS) takes around a minute with the attached experiment.

Steps:

Download: 2023_08_09_211208.zip
Open in IP9
Load waves into TUF via

IgorToTUFStorage(IUTF_TestRun, "IUTF_TestRun")
IgorToTUFStorage(IUTF_Traced_ProcedureNames, "IUTF_Traced_ProcedureNames")

Now ShowTopFunctions should work again

Garados007 commented 1 year ago

I cannot recreate this issue because the data in TUF_XOP is lost and not stored alongside the project file (for the purpose of debugging it could be better if we transfer the TUF_XOP waves as global waves after the test run).

I will look into the algorithm. Maybe I can find something.

Garados007 commented 1 year ago

I looked into the implementation and found some possible critical hot paths:

GetMaxFuncCount with a time approximation of O(proc + proc * lbl)
CollectFunctions with a time approximation of O(proc * func + proc * lines + proc * lbl)
SearchHighestWithMeta with a time approximation of O(cnt * proc * func)

I used the following abbreviations:

proc: Number of instrumented procedure files
lines: The maximum number of lines of an instrumented procedure file
func: The maximum number of functions of an instrumented procedure file
lbl: The constant cost of resolving a wave dimension label. This is added when used inside a loop.
cnt: The number of results that should be returned.

GetMaxFuncCount does resolve a wave dimension label in a loop which can be optimized and the new time approximation would be O(proc).

CollectFunctions is a bit tricky. I can't see any optimizations here.

SearchHighestWithMeta sorts a wave by iteratively searching for the largest element and removing them from the wave. I think this can be improved if we use a sorting function in the first place and then take the first cnt elements from it.

Garados007 commented 1 year ago

Using a sort function in the first place for SearchHighestWithMeta would change the time complexity to O(log(proc * func) * proc * func + cnt). This would only be an improvement if cnt is very large or proc and func are small. We need a way to eliminate unnecessary data pretty fast.

For the example experiment is proc 87 and func 317. This would mean SearchHighestWithMeta must find the 50 largest items from a wave of 27,579 entries.

Garados007 commented 1 year ago

One idea of a filter would be the following algorithm:

Search for each proc the largest value of any of its func.
Sort the procs according to their largest value.
Select only the top cnt procs. Throw away the remaining. This would reduce the data to 50 proc times 317 func.
Pick the smallest highest number of all procs. This is now our pivot element.
Iterate through all procs and throw away all func values that are smaller than the pivot element. This would reduce the data to 50 proc times 50 func.

After filtering we can do a quick sort with a time approximation of O(cnt * log(cnt)), which is faster than iteratively selecting the largest element (time approximation is O(cnt²)). And then select the top cnt elements.

Some steps can be done in parallel (using MultiThread) and skipped if proc (or func) is smaller than cnt.

The resulting time approximation would be O(proc * func + cnt² * log(cnt²)).

t-b commented 1 year ago

Execution time is really close to a minute:

•print secs2time(datetime, 3, 3); ShowTopFunctions(50, mode = UTF_ANALYTICS_FUNCTIONS); print secs2time(datetime, 3, 3)
  21:46:53.697
[...]
  21:47:52.981

t-b commented 1 year ago

Edited the original comment with updated pxp.

Garados007 commented 1 year ago

I have done some function profiling and the results are ... interesting ...

Total time: 37.2419, Time in Function code: 37.2419 (100%)
Top function percentages:
Function igortest-tracing-analytics.ipf GetTotals: 100%

Annotated Top Functions:

*******************************************************************************************
Function: igortest-tracing-analytics.ipf igortest-tracing-analytics.ipf#GetTotals; Percent total 100%
*******************************************************************************************
[00]            |static Function/WAVE GetTotals()
[00]            |   variable i, numWaves
[00]            |
[00]            |   TUFXOP_GetStorage/N="IUTF_TestRun" storage
[00]*           |   numWaves = DimSize(storage, UTF_ROW)
[00]            |   WAVE/ZZ totals
[00]*           |   for(i = 0; i < numWaves; i++)
[00]*           |       WAVE/WAVE/Z entryOuter = storage[i]
[00]*           |       if(!WaveExists(entryOuter))
[00]            |           continue
[00]            |       endif
[00]*           |       WAVE entry = entryOuter[0]
[00]*           |       if(WaveExists(totals))
[100]********** |           totals += entry
[00]            |       else
[00]*           |           Duplicate/FREE=1 entry, totals
[00]            |       endif
[00]            |   endfor
[00]            |
[00]            |   if(!WaveExists(totals))
[00]            |       Make/N=0/FREE=1 totals
[00]            |   endif
[00]            |
[00]            |   return totals
[00]            |End

The storage wave "IUTF_TestRun" contains 449 rows out of which are 426 are not null. Each row points to another wave with 1000 x 3 x 87 entries. They are about 117,189,000 values to add...

If I remove the line totals += entry, the function ShowTopFunctions is finished in 0.0345498 seconds! We need to improve the performance of GetTotals.

(I miss interpreted the GetTotals function completely and just assigned O(lines) to it. But the data is several times larger than that...)

Garados007 commented 1 year ago

I found a possible solution. We can copy all the data of the wave reference wave into a normal 4D (1000 x 3 x 87 x 449 entries) wave in a multithreaded operation. After that we need to call SumDimension/D=3 and have our totals.

t-b commented 1 year ago

@Garados007 Go for it!

Garados007 commented 1 year ago

I could reduce the time to 9.147 seconds but I am not really happy about it. Here is the function profiling result:

Total time: 9.14777, Time in Function code: 8.72328 (95.4%)
Top function percentages:
Function igortest-tracing-analytics.ipf GetTotals2: 100%

Annotated Top Functions:

*******************************************************************************************
Function: igortest-tracing-analytics.ipf igortest-tracing-analytics.ipf#GetTotals2; Percent total 100%
*******************************************************************************************
[00]            |static Function/WAVE GetTotals2()
[00]            |   variable i, numWaves
[00]            |   variable bufX, bufY, bufZ
[00]            |
[00]*           |   TUFXOP_GetStorage/N="IUTF_TestRun" storage
[00]            |   numWaves = DimSize(storage, UTF_ROW)
[00]            |   
[00]            |   for(i = 0; i < numWaves; i++)
[00]            |       WAVE/WAVE/Z entryOuter = storage[i]
[00]            |       if(!WaveExists(entryOuter))
[00]            |           continue
[00]            |       endif
[00]*           |       WAVE entry = entryOuter[0]
[00]            |       
[00]            |       bufX = DimSize(entry, UTF_ROW)
[00]            |       bufY = DimSize(entry, UTF_COLUMN)
[00]            |       bufZ = DimSize(entry, UTF_LAYER)
[00]            |       
[00]            |       break
[00]            |   endfor
[00]            |   
[10]*           |   Make/FREE=1/N=(bufX, bufY, bufZ, numWaves) buffer
[66]*******     |   MultiThread storage[] = CopyTotalsToBuffer(p, storage, buffer)
[00]            |   
[24]**          |   SumDimension/D=(UTF_CHUNK)/DEST=totals/Y=(-1) buffer
[00]            |
[00]*           |   return totals
[00]            |End

And the implementation of CopyTotalsToBuffer:

threadsafe static Function/WAVE CopyTotalsToBuffer(i, storage, buffer)
    variable i
    Wave/WAVE storage
    Wave buffer

    WAVE/WAVE/Z entryOuter = storage[i]
    if(!WaveExists(entryOuter))
        return entryOuter
    endif
    WAVE entry = entryOuter[0]

    buffer[][][][i] = entry[p][q][r]

    return entryOuter
End

@t-b @MichaelHuth Do you have an idea to copy the entries to the buffer even faster than the above method? I had tried Concatenate { waveRefWave }, buffer but the results are really bad (but slightly better than the original implementation).

Garados007 commented 1 year ago

There is a reason I am not really happy about the result. SumDimension has to access all values from the buffer at least once and create a new wave with all the data. Copying the original data to the buffer 1:1 does take 3-times the time than SumDimension needs and I do it multi-threaded!

t-b commented 1 year ago

Why not just use

            FastOp totals += entry

in the orignal code? This makes it run in 4s here.

byte-physics / igortest

ShowTopFunctions is too slow #456