[LArbys/cafef] IO optimization

After benchmarking w/ pascal titan x and comparing its speed w.r.t. maxwell titan x, I realized speed does not scale often from one network to another. It turned out, for a subject network I was testing with, IO wait became significant when using pascal due to increased computation power (i.e. gpu computation time reduced, IO wait on CPU unchanged, causing the fraction to become more significant).

This came down to 2 components: one is memory copy from cpu to gpu SRAM about which I cannot do anything simple, and another thing is an IO optimization that is possible.

IO optimization has 2 components to it: a) memory allocation of blob @ each mini-batch data loading b) copying data from larcv IO manager into blob

To optimize the performance both should be threaded. This requires a total of 3 threads: 0) LArCV IO thread (already exist and in use) 1) blob memory allocation thread ... to cover a) 2) data copy from larcv to blob upon the completion of 0) and 1), which means this thread maintains above 2 threads.

I implement above in root data layer where the main thread instantiate 2) which in turn instantiates 0) and 1). Also it would be nice to implement an option to measure time spent at each stage and report periodically so that anyone can try and notice this kind of problem in future.

LArbys / LArCV

[LArbys/cafef] IO optimization #80