Ysz2022 / NeRCo

[ICCV 2023] Implicit Neural Representation for Cooperative Low-light Image Enhancement
https://openaccess.thecvf.com/content/ICCV2023/html/Yang_Implicit_Neural_Representation_for_Cooperative_Low-light_Image_Enhancement_ICCV_2023_paper.html
211 stars 14 forks source link

A question about Why Neural Representation Works? #3

Open kaiq663 opened 1 year ago

kaiq663 commented 1 year ago

There is a paragraph

”With the trained FMLP, each feature map E can form a function FMLP(E, ·) : X → INR, which maps coordinates to its predicted RGB values. Without E, it is impossible for FMLP to depict various RGB values with the same coordinates X. Without X, we cannot normalize degradation by adjusting fitting capability, which is explained below“,

I cannot get it, could you explain more orally, with an example?Thanks

Ysz2022 commented 1 year ago

Thanks for your interest in our work. This paragraph aims to explain that the feature map E is essential. Let me give you an example.

We assume that the resolution of input image "A" is 256×128×3. The width of A is 256, the height is 128, and A contains 3 color channels, i.e., R,G,B.

As shown in the gray region of Fig 2 in our paper, on the one hand, given an encoder, it extracts a feature map "E" from A. Let's assume it has 64 channels, then the size of E is 256×128×64. On the other hand, we generate a coordinate map "X" of size 256×128×2, which assigns a unique 2D coordinate to each pixel in A. With FMLP, we can express the output as: INR = FMLP(E, X). If we only input coordinate map X without feature map E, the equation becomes: INR = FMLP(X). It's not feasible.

When we have another image "B" which has the same size as A, i.e. 256×128×3, but with very different content. Think what will happen? We also generate a coordinate map "X1", whose size is also 256×128×2! Since we use the same algorithm to generate the coordinates, the content of X1 is the same as that of X, i.e., X1 = X. But the content of B is different from A, which means that our target output is very different.

Therefore, if we only employ X as the sole input, i.e., INR = FMLP(X). When we have two images with the same size (such as A and B), our input X remains unchanged but our target output changes a lot. A fixed trained model FMLP cannot tackle it. Hence, we need to extract a unique feature map E for each input, describing the content of the image itself. In this way, when we input A and B, we can obtain different features E, since the input E has changed, FMLP can thereby output different results.

zhuyr97 commented 11 months ago

I am still confused about the 'norm function' of the Neural Representation. The above answer seems still do not clarify the reason.

Ysz2022 commented 11 months ago

Thanks for your interest in our work :). If possible, please specify what statement you do not understand and I will provide detailed answers. Here, I guess you're asking about the necessity of X and E.

Let me briefly explain it. We need E to tell the specific info of input image to FMLP for reproducion (reasons are given above). We also need X to adjust the representation capacity of FMLP, which is discussed in Sec 3.2 in our paper.

zhuyr97 commented 11 months ago

Thanks! I have read Sec 3.2 again. Is the 'norm function ' of NRN actually observed during the experiments? Moreover, can you explain the effect of hyperparameter L to achieve the trade-off between the degradation normalization and content fidelity?

Ysz2022 commented 11 months ago

Admittedly, at first, we only attempted to use neural representation to represent low-light images, and during the experiment, we found that it has the property of normalizing brightness. Thereby, we tried to find the reason and attributed it to the value of L.

The representation capacity of MLP is heavily impacted by position encoding, which has been also proved by [1]. We further conducted ablation study on the value of L, and found that when L is smaller than 8, the reproduced low-light images contain obvious noise and lose many details, while as L increases, the reproduced images are becoming similar to input images. We found that as L changed from small to large, MLP initially does not faithfully reproduce the brightness info of images. We believe this is because learning a unified brightness has achieved convergence at the current stage. When L is large enough, MLP will begin to reproduce the unique brightness of each image. Hence, L achieves the trade-off between the degradation normalization and content fidelity.

[1] Rahaman, N., Baratin, A., Arpit, D., Dr ̈axler, F., Lin, M., Hamprecht, F.A., Bengio, Y., Courville, A.C.: On the spectral bias of neural networks. In: ICML (2018)

zhuyr97 commented 11 months ago

Thanks! Actually, such a brightness-normalizing effect of NR is surprising and interesting.

QiuJueqin commented 6 months ago

Same confusion here.

As stated in Eq. (2) (L1 loss), the combination of encoder + decoder (MLP) will likely degenerate to an identity mapping, and the positional embedding degenrate to an dummy mapping (means it output nothing). In such case, the L1 loss will be zero as the output is exactly the input.

Cannot understand why the combination of encoder+positional embedding+MLP decoder has effect of "normalization". Is the encoder elaborately designed to avoid degenerating to an identity mapping?


btw your first comment on Apr 18 only answered why feature map E is crucial for reproducing input, but didn't answer why positional embedding X is necessary :)

Ysz2022 commented 6 months ago

Fine, let me explain your questions one by one.

First, about identity mapping, positional embedding will never degenrate to a dummy mapping, since it is achieved by the funtion γ() mentioned in Eq. 3 of our paper, which is a sine-cosine transformation but not a dummy mapping.

Second, note that we train the combination of encoder + decoder (MLP) with other modules, including Enhance module and Degrade module, etc, collaboratively, rather than only training a fitting function NRN. These additional losses disturb the updating of NRN. Besides, if you only train the NRN to ask it to reproduce the same image as the input, admittedly, it is possible for NRN to be an identity mapping, however, in fact, it is very difficult. I have done such an experiment on low-light images but found that its output are some meaningless dark images. you can also conduct this experiment by yourself and welcome to discuss results with me :)

Finally, "why positional embedding X is necessary" has been explained in Sec 3.2 of our paper. If you still have some further questions, welcome to speaking it specifically and I will try my best to answer it :)

QiuJueqin commented 6 months ago

thanks for such a quick reply!

Have you tried to train NRN solely from scratch? Would it learn normalization effect? If yes, it will be a very attractive self-supervised approach that "rectify" varying input domains. If no, I would presume that the normalization effect is just a by-product because the subsequent operations tend to accept inputs from a narrower domain --- in this case, perhaps the choice of architecture design for NRN doesn't matter?

Finally, "why positional embedding X is necessary" has been explained in Sec 3.2 of our paper.

I read this section again, and still cannot get the necessity of X --- all information required to reproduce, or normalize, the input has already been provided by E, so why decoder would be encouraged to extract extra information from X?

btw I said positional embedding became a dummy mapping meant that the decoder is prone to be insensitive to X, in such case X becomes a useless input and thus can be safely removed without affecting final performance

Ysz2022 commented 6 months ago

I have tried training NRN solely on low-light images from scratch but found that its results are meaningless distributions. But if you remove MLP and only use an encoder, NRN- cannot perform such a capability of lightness normalization. I suppose MLP is also necessary.

Although encoder has extracted enough information, X and MLP is still necessary. Since the combination of X and MLP is the key to adjust the fitting capability of NRN, this phenomenon has been studied by [1]. All pixels are produced by MLP, while MLP predicts RGB values pixel by pixel based on the feature E. In this process, the dimension of X greatly affects the sensitivity of MLP to content changes in images.

[1] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, 2020.