Decoder CoreML running time problem

benjamin-kramer commented 7 months ago

When running the decoder with the same number of points time after time, the running time may approach the number reported by the performance analysis of Xcode. However, this is not the usual use case of this model. Usually, one would select points iteratively, such that the size of the model input constantly changes (first 256x64x64, 1x1x2, 1x1, than 256x64x64, 1x2x2, 1x2 and so on). Every time the model is used with a different size, some internal CoreML state is discarded, and the running time is that of a first-run (which is ~10x slower!).

If the model can be designed such that it is constantly run with the same number of points (16), with some of the points being ignored, perhaps it could help resolve this issue (but I really have no idea if that's possible).

benjamin-kramer commented 7 months ago

Note - Running the model with CPU+ANE (instead of .all compute units, which defaults to GPU and ANE) improves the running time (at least on an iPhone12), but it doesn't solve this issue.

benjamin-kramer commented 7 months ago

Note 2 - There is a piece of code here that looks like it should allow points to be ignored, but it doesn't work (setting a label of -1 interferes with the detection of points with 0 / 1 labels).

chongzhou96 commented 1 month ago

Unfortunately, we cannot solve the problem as well. I benchmarked on the Instruments with a real iPhone 14 attached and got similar results as yours. It seems that CoreML does not support dynamic batching well. I also considered adding padding points and it did not work well too.

chongzhou96 / EdgeSAM

Decoder CoreML running time problem #19