Closed csuter closed 4 years ago
Your take was also very reasonable! We could benchmark them against each other, but I'd be surprised to see a big perf diff. I'm still worried about the impact of XLA compilation on runtime. We'll keep pursuing this on our end.
For some weird reason the GPU on my laptop still isn't playing ball, but FWIW on the CPU I get a 74s runtime. However, the CPU usage is about 103% -- I would have expected a bit more, particularly in terms of parallelisation of the Kroneckers in the rate calculation and batched expm and sampling functions.
I was able to run on GPU (with XLA turned off) and got around the same runtime as on CPU. I didn't peek at utilization though.
This brings the runtime for 195 days down to about 60sec. It's still really slow in XLA mode; not sure why. We can follow up with some performance analysis on our side.
I am reasonably confident that what I wrote is correct, but we should verify.