What is the advantage in learning 2D images?

resurgo97 commented 1 year ago

Hi, I appreciate your work.

I would be happy if you answer my questions.

1) What is the advantage of DINER in learning 2D images, when the model size (hash table + MLP) leads to only little compression? For example, for an 2D RGB image, the table size is already 2/3 of the original image.

2) I think the performance gain over InstantNGP in 2D image fitting task is, each pixel is learning its unique feature vector. How can it be applied to signals with higher dimensions? Does the complexity(=L) grow exponentially to the dimension?

3) Why is the hash table called "hash table", when there is no hash collision? Have you considered the design where L is smaller than the # of elements of a discrete signal?

Ezio77 commented 1 year ago

Thank you very much for your attention to our paper. I will do my best to answer all of your questions.

Our experiments on image and video representation are only intended to show the capabilities of our method for signal representation. Data compression, as you mentioned, is indeed a downstream task for implicit neural representation, but our algorithm does not consider the efficiency of data compression.
The reason our algorithm performs better than others is true as you described. We also analyzed some defects of InstantNGP on multi-resolution grids, such as hash collision and instability for interpolation. For higher-dimensional signals, our algorithm does face the challenge of increasing space complexity. However, for common applications such as neural radiance fields, we can achieve dense grid segmentation in three dimensions and train on a single GPU (detailed experimental results will be published in a forthcoming journal paper). How to reduce the storage consumption of grid representation will also be one of our key research areas in the future.
The hash table is a way to demonstrate our coordinate mapping relationship. Indeed, the name of the hash table is not a perfect one and a linear lookup table might be an alternative choice. Anyway, these all represent a one-to-one corresponding coordinate mapping relationship. If L is decreased, there will inevitably be hash collisions and interpolation problems, which can lead to decreased representation performance as discussed in the paper.

Finally, I would like to thank you for your valuable feedback. I hope my answers will be satisfactory for you.

Qsminefor commented 1 year ago

Thanks for your great work! I also have a question about your paper. As you described, the table you defined is a one-to-one corresponding coordinate mapping relationship. I am curious about how to have the ability to represent disorder targets using one representation as shown in the paper.

Albert2X commented 1 year ago

Hi, I appreciate your work.

I would be happy if you answer my questions.

1. What is the advantage of DINER in learning 2D images, when the model size (hash table + MLP) leads to only little compression?
   For example, for an 2D RGB image, the table size is already 2/3 of the original image.

2. I think the performance gain over InstantNGP in 2D image fitting task is, each pixel is learning its unique feature vector.
   How can it be applied to signals with higher dimensions? Does the complexity(=L) grow exponentially to the dimension?

3. Why is the hash table called "hash table", when there is no hash collision?
   Have you considered the design where _L_ is smaller than the # of elements of a discrete signal?

I also have a similar confusion to @resurgo97. It seems that each pixel is learning its unique feature vector. Let's suppose the feature vector has 3 dimensions and is equal to the RGB values of the corresponding pixel's position. In that case, we only need to learn a trivial or identical MLP (which means the input and output are identical) to obtain quite good results. So, I think that instead of proposing a 'hash table,' one contribution of your work is revealing the coordinate-related pattern distribution, like the RGB distribution in a 2D coordinate plane, of a given learned INR network.

Ezio77 commented 1 year ago

Thanks for your comments! The proposal of 'hash table' is a way to enlarge the capacity of INR models' representation. Experiments on fitting an image or other signal forms are just designed to prove the point above. What is more important is how we apply this to solve inverse problems.

In fact, a signal can be represented directly by a dense grid without neural networks. But for solving inverse problems, such a representation can easily fall into local optimality due to the strong correlation between the parameters to be optimised. Our method DINER achieves a good balance between the capacity of representation and the ability of optimization.

I hope my answers above will be usefull and i'm looking forward to more conversations afterwards.

skxgogo commented 6 months ago

I also have similar questions on your work which will be published on the journal. Why your experiments of nerf only are conducted on the downsampled scene，i.e. the input image of the Blender Dataset is 400400 resolutions? However, the standard comparison should be conducted on the 800800 resolutions.

Ezio77 commented 6 months ago

Thanks for your attention to our work. We have conducted additional experiments with a resolution of 800x800 which are currently under review.

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年1月5日(星期五) 上午10:31 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Ezio77/DINER] What is the advantage in learning 2D images? (Issue #4)

I also have similar questions on your work which will be published on the journal. Why your experiments of nerf only are conducted on the downsampled scene，i.e. the input image of the Blender Dataset is 400400 resolutions? However, the standard comparison should be conducted on the 800800 resolutions.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Ezio77 / DINER

What is the advantage in learning 2D images? #4