CenterPoint Backbone preprocessing optimization

The current implementation of scatter has some limitation.

the GPU implementation hard coded iterator bindings which might not work for certain devices. For example, for OpenCL backend, if a GPU has only one dimension global work size.

    for j in T.thread_binding(0, 560, thread = "blockIdx.x"):
        for k in T.thread_binding(0, 560, thread = "blockIdx.y"):
            for i in T.thread_binding(0, 32, thread = "threadIdx.x"):

There is no room for optimization because of hard code. Normally, we need to create schedule from IRModule and define optimization strategies.
Need to create a optimization schedule and measure its performance.

autowarefoundation / modelzoo

CenterPoint Backbone preprocessing optimization #83