Test inference performance optimization on CPU

measure inference performance in the units of "instance / sec" in four settings

[ ] python native gpu,
[ ] python native cpu,
[ ] torchsript cpp cpu
[ ] torchsript cpp gpu

measurement should avoid initial memory allocation time. batch size will mater a lot for GPU. Maybe try two settings, 1 datapoint per batch, and as many as possible to fill the 48GB GPU memory.

@blackcathj can we group data together in a production system?

BNL-DAQ-LDRD / NeuralCompression

Test inference performance optimization on CPU #11