We were leaving the label predictions tensors on the GPU,
which was leading to lots of expensive GPU calls to read
the data. In particular, tt profile revealed the following:
Scalene's GPU time reporting is generally full of false attribution,
but it was a hint in the right direction in this case.
Based on my interpretation of the code, we were being forced
to access this tensor off the GPU three times for every input
word in the document.
After pulling the label preds into system memory:
I confirmed the remaining high GPU items reported by the
profiler have nothing to do with the GPU,
and the code is too scary to futz about with anyway.
We were leaving the label predictions tensors on the GPU, which was leading to lots of expensive GPU calls to read the data. In particular,
tt profile
revealed the following:Scalene's GPU time reporting is generally full of false attribution, but it was a hint in the right direction in this case. Based on my interpretation of the code, we were being forced to access this tensor off the GPU three times for every input word in the document.
After pulling the label preds into system memory:
I confirmed the remaining high GPU items reported by the profiler have nothing to do with the GPU, and the code is too scary to futz about with anyway.