Closed bentsherman closed 7 years ago
Based on further analysis, I think the timer module consistently reports times that are roughly 10x too long, based on the time-stamps of log messages and intuition (what felt like 2-3 s was reported as 23 s). The root cause seems to be the clock()
function, so maybe I'm interpreting the output of clock()
incorrectly.
It turns out the issue was my understanding of CPU time. When an application runs multiple threads, CPU time includes the time from each thread and adds them together. Since OpenBLAS uses pthreads
, our system uses multiple cores when they are available. I think that explains the discrepancy in time -- I thought the times were roughly 10x too long, but I was using a node with 8 CPUs.
I'm pretty sure this is happening now: sometimes
./face-rec
will report that training time was 20-30 s when in fact it was only 2-3 s. You can see this error yourself by tracking the timestamps of log messages. It seems to be off by a factor of 10 consistently, so hopefully this issue won't be too difficult to trace.