Open resurgo97 opened 1 year ago
Thank you very much for your attention to our paper. I will do my best to answer all of your questions.
Our experiments on image and video representation are only intended to show the capabilities of our method for signal representation. Data compression, as you mentioned, is indeed a downstream task for implicit neural representation, but our algorithm does not consider the efficiency of data compression.
The reason our algorithm performs better than others is true as you described. We also analyzed some defects of InstantNGP on multi-resolution grids, such as hash collision and instability for interpolation. For higher-dimensional signals, our algorithm does face the challenge of increasing space complexity. However, for common applications such as neural radiance fields, we can achieve dense grid segmentation in three dimensions and train on a single GPU (detailed experimental results will be published in a forthcoming journal paper). How to reduce the storage consumption of grid representation will also be one of our key research areas in the future.
The hash table is a way to demonstrate our coordinate mapping relationship. Indeed, the name of the hash table is not a perfect one and a linear lookup table might be an alternative choice. Anyway, these all represent a one-to-one corresponding coordinate mapping relationship. If L is decreased, there will inevitably be hash collisions and interpolation problems, which can lead to decreased representation performance as discussed in the paper.
Finally, I would like to thank you for your valuable feedback. I hope my answers will be satisfactory for you.
Thanks for your great work! I also have a question about your paper. As you described, the table you defined is a one-to-one corresponding coordinate mapping relationship. I am curious about how to have the ability to represent disorder targets using one representation as shown in the paper.
Hi, I appreciate your work.
I would be happy if you answer my questions.
1. What is the advantage of DINER in learning 2D images, when the model size (hash table + MLP) leads to only little compression? For example, for an 2D RGB image, the table size is already 2/3 of the original image. 2. I think the performance gain over InstantNGP in 2D image fitting task is, each pixel is learning its unique feature vector. How can it be applied to signals with higher dimensions? Does the complexity(=L) grow exponentially to the dimension? 3. Why is the hash table called "hash table", when there is no hash collision? Have you considered the design where _L_ is smaller than the # of elements of a discrete signal?
I also have a similar confusion to @resurgo97. It seems that each pixel is learning its unique feature vector. Let's suppose the feature vector has 3 dimensions and is equal to the RGB values of the corresponding pixel's position. In that case, we only need to learn a trivial or identical MLP (which means the input and output are identical) to obtain quite good results. So, I think that instead of proposing a 'hash table,' one contribution of your work is revealing the coordinate-related pattern distribution, like the RGB distribution in a 2D coordinate plane, of a given learned INR network.
Thanks for your comments! The proposal of 'hash table' is a way to enlarge the capacity of INR models' representation. Experiments on fitting an image or other signal forms are just designed to prove the point above. What is more important is how we apply this to solve inverse problems.
In fact, a signal can be represented directly by a dense grid without neural networks. But for solving inverse problems, such a representation can easily fall into local optimality due to the strong correlation between the parameters to be optimised. Our method DINER achieves a good balance between the capacity of representation and the ability of optimization.
I hope my answers above will be usefull and i'm looking forward to more conversations afterwards.
I also have similar questions on your work which will be published on the journal. Why your experiments of nerf only are conducted on the downsampled scene,i.e. the input image of the Blender Dataset is 400400 resolutions? However, the standard comparison should be conducted on the 800800 resolutions.
Thanks for your attention to our work. We have conducted additional experiments with a resolution of 800x800 which are currently under review.
------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年1月5日(星期五) 上午10:31 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Ezio77/DINER] What is the advantage in learning 2D images? (Issue #4)
I also have similar questions on your work which will be published on the journal. Why your experiments of nerf only are conducted on the downsampled scene,i.e. the input image of the Blender Dataset is 400400 resolutions? However, the standard comparison should be conducted on the 800800 resolutions.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Let me say thanks for your great work! I really like your motivation and analysis on why transforming the input coordinate to represent low-frequency can lead to better reconstruction result by essentially avoiding the spectral bias. However, I am having trouble understanding your method:
The proposal of 'hash table' is a way to enlarge the capacity of INR models' representation. Experiments on fitting an image or other signal forms are just designed to prove the point above. What is more important is how we apply this to solve inverse problems.
The way I understand it, the "hash table" is more or less "saving" the actual output to some degree. Since its feature dimension for each pixel is two, we can think of it as saving the first 2 channels of the image, and the third channel is computed based on the remaining MLP. I agree with @Albert2X on this one. In fact, after running your image fitting code, I saved the "hash table" into an image and got the following results (which confirms my suspicion):
Original Image
The hash table has dimension of (HxW)x2, so I make two separate image to represent the first and second element: First element Second element
If I used the original coordinate without the hash table and feed it to the network, I got:
In fact, a signal can be represented directly by a dense grid without neural networks. But for solving inverse problems, such a representation can easily fall into local optimality due to the strong correlation between the parameters to be optimised. Our method DINER achieves a good balance between the capacity of representation and the ability of optimization.
Could you please elaborate more on this? Thanks
Hi, I appreciate your work.
I would be happy if you answer my questions.
1) What is the advantage of DINER in learning 2D images, when the model size (hash table + MLP) leads to only little compression? For example, for an 2D RGB image, the table size is already 2/3 of the original image.
2) I think the performance gain over InstantNGP in 2D image fitting task is, each pixel is learning its unique feature vector. How can it be applied to signals with higher dimensions? Does the complexity(=L) grow exponentially to the dimension?
3) Why is the hash table called "hash table", when there is no hash collision? Have you considered the design where L is smaller than the # of elements of a discrete signal?