flux-framework / PerfFlowAspect

An Aspect Oriented Programming (AOP)-based tool to analyze cross-cutting performance concerns of composite science workflows.
https://perfflowaspect.readthedocs.io
GNU Lesser General Public License v3.0
2 stars 15 forks source link

C/C++: Read/Write to temporary file is adding Overhead to Duration in Compact format #159

Open alizalisan opened 4 months ago

tpatki commented 4 months ago

@alizalisan @loudsun1997

On the python side, we don't have this problem as we have an "around" decorator and we don't need to call the before/after functions to mimic around; as a result we don't have to keep track of values separately in a file (or in memory).

As a first cut, we're going to make sure that the start timestamp is updated here: https://github.com/flux-framework/PerfFlowAspect/blob/82cf516676dc2688070e218ec88f9fac95e75ea6/src/c/runtime/advice_chrome_tracing.cpp#L757. Essentially, either we create the event here, or update my_ts here, so we don't include the file creation time (PFA overhead) in our reporting (measurement of user/application function).

In the future, we need to think about how to handle this differently: is there a way to not use files that is convenient? Using file I/O here can add to overheads.

alizalisan commented 3 months ago

@tpatki @loudsun1997

  1. The ending timestamp is being recorded at the correct and same time for both compact and verbose formats. It does not need to be moved as the ending timestamp is recorded at the very beginning of the after function. It does not include durations for file opening and reading.

  2. The beginning timestamp can be updated in the before function up until the point where it is written to the file which makes a little difference. The issue is that the single file write operation is taking around 250-430 microseconds.

  3. The difference seems more significant with tiny functions like bas or bar which are not compute intensive. This can be ignored in compute intensive/actual functions i.e. becomes insignificant.

  4. Following are the durations for compact and verbose formats for the three functions from smoketest3.py. This data includes the change where the beginning timestamp is being recorded before writing it to the file.

    Function            Verbose             Compact
    foo                309ms 87us         309ms 651us
    bar                108us              2ms 209us
    bas                55us               439us
tpatki commented 1 month ago

@slabasan Tagging you on this issue.

One efficient solution for this is to probably have a per-thread "stack" (function values will be pushed in order of arrival and popped as they exit). We can push the timestamp in "before", so we capture these in the order of arrival, and we pop from the stack as we enter the "after". The ideal way to implement this is with a dynamically allocated "stack" (can be implemented with std::deque) that is per-thread. But this can be complex to track, as PFA will need to be aware of the number of threads (and the associated process IDs) to ensure that we create enough of these stacks. But such an approach should reduce our overhead drastically compared to the potentially hundreds of files we are creating right now.

As of right now, we're creating num_processes * num_threads * all_annotated_functions number of temp files (!!), which was our first cut at supporting this 2 years ago. This overwhelms the file system and causes a massive overhead when computation time is small. If you have a lot of function calls, this also can trigger ulimit problems with respect to open file descriptors and the file-sizes. Note that in an application that's very compute intensive, this overhead is very small, but if you have an app that has less compute, the I/O time from all the files we create dominates.