ReconfigureIO / reco-sdaccel

0 stars 1 forks source link

Seemingly large overhead running vivado on docker #247

Open pwaller opened 5 years ago

pwaller commented 5 years ago

Take this perf analysis, for example, recorded while an synthesis is happening with vivado:

sudo yum install -y perf
sudo perf record -g -a sleep 10
sudo perf report
[snip]
   - 34.64% sys_read
      - 34.51% vfs_read
         - 34.26% seq_read
            - 34.01% proc_single_show
               - 33.81% proc_pid_status
                  + 11.45% render_sigset_t
                  + 10.17% cpuset_task_status_allowed
                  + 5.09% seq_printf
                  + 3.30% task_mem
                  + 2.15% render_cap_t
[snip]

It shows 34% of cycles spent in sys_read (i.e. the read syscall), and of those we're spending tons of time in proc_pid_status. This makes no sense. Why would we spend all our CPU budget interrogating the pid status and time in seq_printf?

According to my mental model, if this measurement is correct, we're wasting a lot of time on the kernel when we could be using it for synthesis. The obvious culpret would be seccomp, which effectively runs a Berkeley Packet Filter on every syscall invocation. However, the above measurement was taken in a docker container which was invoked with --security-opt seccomp=unconfined, so that theory is questionable. I'd love to know why proc_single_show is appearing in our stack traces.

pwaller commented 5 years ago

Another important measurement:

root@8f54deb133ef:/mnt/examples/histogram# time /opt/sdaccel-builder/sdaccel-builder image
[snip]

real    126m30.430s
user    216m2.451s
sys 91m6.848s

i.e. In order to build the histogram example, we're spending 91 CPU-minutes on sys, when our wall clock is 126 minutes. It might go a lot faster if we could reduce the sys overhead...

pwaller commented 5 years ago

Minor note, this stack trace shows up in an example trace given in the linux docs, so it perhaps isn't as uncommon as I thought:

https://github.com/torvalds/linux/blob/7566ec393f4161572ba6f11ad5171fd5d59b0fbd/Documentation/trace/ftrace.rst#latency-trace-format

pwaller commented 5 years ago

Oh, I think I'm totally misinterpreting the stack trace and now I think I know what's going on. proc_single_show is actually the procfs implementation, i.e. these are actually read() syscalls occurring on procfs... https://github.com/torvalds/linux/blob/7566ec393f4161572ba6f11ad5171fd5d59b0fbd/fs/proc/base.c#L725-L753

Yikes. This seems really bananas. Unfortunately perf doesn't seem to be capturing the stacks of the read syscalls. But this still smells quite bad.

pwaller commented 5 years ago

By the way, I left a second run going with seccomp disabled on the same machine. Here is the timing. It's only a 5% wall clock effect and the sys CPU% was even higher (!?). So I'm fairly perplexed.

real    119m13.907s
user    284m32.525s
sys 141m2.333s

Definitely seems wrong that so much CPU% should be spent inspecting the proc filesystem, though.