Question on designs of pixel-query network

Thanks for your awesome work on image inpainting and great effort on open-source project! Currently I'm working on applying the desgin of "pixel-query" method (for medical image inpainting), and found your work is really beneficial to high-resolution medical image inpainting, thank you! Here I'm curious about the design of pixel-wise mlp for decoding (refered as design A). From the paper and your code, I learn to know that (if anything is wrong, please point it out) the parameter generation network generates params for mlps "pixel-wise", which means that on the low-resolution pixels (16 16), each pixel has a different paramter , based on the feature map output from the attention ffc blocks. After this, given a queried pixel from the high-resolution, the code first finds the nearest low-resolution pixel, and use the correspoding pixel-wise mlp params, considering the coordinates and the image scale. I would like to compare this design with another possible design as follows, which is my guess of your implementation of compared baseline named "D_mlp", or to say, using the shared mlp for all the pixel decoding. The possible design (refered as Design B)first interpolate (bilinear) the features of each pixel on low-resolution feature map (produced by attention ffc block) into a high resolution level (like 10241024). Then, for each pixel on high-resolution level image, find the corresponding feature, concat them with the coordinates and coordinate embeddings as p (the all concat results for this pixel), and let p be the input of a shared mlp (accross all the pixels, only one mlp is used)，and the output is the predicted intensity (or rgb value for natural images). I have these following confusions: 1) is the desgin B the same as the paper compared in table 3, line 3, which use the "D_mlp" as decoder? If not, what is the difference? Such as, did the D_mlp use the coordinates of high-resolution pixels? 2) if design B is the same as D_mlp, why is the design A much faster then design B? I see that design B is faster (almost the same speed) as A, only the parameter generation process (a linear layer) is left out, and the gpu memory consumption is lower (no need to store hw=1616=256 mlp params). 3) if the design B is not the same as D_mlp, then compared with design A, I wonder why pixel-wise mlp parameter generation is essential. As the NeRF shows, a scene can be represented only by one single mlp, why pixel-wise mlp? Or to say, why is pixel-wise mlp gaining much better peformance, compared to shared parameters all over the pixels. Due to representation ability? I quite don't understand. 4) if the design B is not the same as D_mlp, what do you believe might be the biggest problem and why did you not choose this design? If you have ever compared these two designs.

I'm currently diving deep into applying your design into my work, and here are my questions. Sorry to bother you with so many discussions, and thankful for your reply! If my work based on yours come into paper finally, I'd like to cite your work , and acknowledge your detailed open-source project and great help!

NiFangBaAGe / CoordFill

Question on designs of pixel-query network #13