comparing the floating-point operation throughput or memory throughput - whichever makes more sense - of a particular kernel to the corresponding peak theoretical throughput of the device indicates how much room for improvement there is for the kernel.
[ ] select a number of typical kernels (for instance from typical CUDA examples)
[ ] choose a GPU to run these on
[ ] measure the corresponding throughput for each kernel
The CUDA performance guidelines state: