Open benjamin-kramer opened 7 months ago
Note - Running the model with CPU+ANE (instead of .all
compute units, which defaults to GPU and ANE) improves the running time (at least on an iPhone12), but it doesn't solve this issue.
Note 2 -
There is a piece of code here that looks like it should allow points to be ignored, but it doesn't work (setting a label of -1
interferes with the detection of points with 0
/ 1
labels).
Unfortunately, we cannot solve the problem as well. I benchmarked on the Instruments with a real iPhone 14 attached and got similar results as yours. It seems that CoreML does not support dynamic batching well. I also considered adding padding points and it did not work well too.
When running the decoder with the same number of points time after time, the running time may approach the number reported by the performance analysis of Xcode. However, this is not the usual use case of this model. Usually, one would select points iteratively, such that the size of the model input constantly changes (first
256x64x64, 1x1x2, 1x1
, than256x64x64, 1x2x2, 1x2
and so on). Every time the model is used with a different size, some internal CoreML state is discarded, and the running time is that of a first-run (which is ~10x slower!).If the model can be designed such that it is constantly run with the same number of points (16), with some of the points being ignored, perhaps it could help resolve this issue (but I really have no idea if that's possible).