Closed qsobad closed 3 years ago
We have not yet tested with the sin activation functions proposed in the SIREN paper. Please let us know if you find them beneficial.
if it is as it said in the paper, it is possible to replace positional encoding with relu, so possibly reduce number of input channels here.
@qsobad they recently also researched on other encodings, the results are compelling, you can compare to siren: https://github.com/tancik/fourier-feature-networks
@kwea123 thanks! i read and gaussian seems fine so far but i still couldn't put sirens to work and reproduce the result it claims. it does not converge all the time and neither does it better than p.e. when converge. i am still investigating what goes wrong.
@qsobad What is your model structure? I also tried siren in NeRF but experienced the same thing as you: its loss is higher and seems to overfit (high val loss) I tried 7 layers xyz -> sigma and 7 layers xyzdir -> rgb. Maybe I should try the model similar to nerf (use intermediate inputs)
i tried a lot of things, with diff initiation as well. i guess it's nothing about layers, nor skip connection, as from the result of my trials.
actually siren and gaussian shares a lot of things in common: the normal initalization and fourier features. my understanding is that in siren the weight is the freq, while gaussian specifically pick the freq, that's why it's harder to control in siren. (correct me if i am wrong) that's the difference i see, i just can't make the best use of siren so far given its freedom.
Have you tried the Gaussian embedding? I don't find it better either... I modify only the embedding in this repo, so the model structure is a lot more complicated than the experiments done in the paper which is oversimplified (no coarse to fine, no direction embedding...). I tuned lots of parameters, like the standard deviation, the number of embedding sizes, etc, but none of them helps.
It makes me think if the Gaussian only works in the simple settings as in the paper? Like it improves the PSNR from 24 to 25... but in reality 25 is really blurry too. Another possibility is that maybe the embedding is not "friendly" to this NeRF volume rendering? In normal image reconstruction the score is a lot higher than NeRF..
in my case, i use 64 freqencies in gaussian, and i observe clearer edge (not blur), however it has a lot of noise (black dots).
What scene do you use? And how about the psnr score?
I tested on a forward facing real scene, my current finding is that instead of using sin(ax+by+cz)
as in the paper (which degrades the result largely), using combination of sin(ax) sin(by) sin(cz)
gives me better result in PSNR (than the positional encoding). But I have only tested on one scene so I don't know if it generalizes
i use my own scene (360 circular real scene), score is worse than pe. mine is:
for freq in self.freq_bands_gaussian: for func in self.funcs: out += [func(torch.einsum("ij,j->ij", x, freq))]
i guess it is the same as sin(ax), sin(by), sin(cz). or not?
Is your scene forward facing? I'm thinking if it's the NDC part that poses problem. NDC maps the whole space into [-1, 1]^3 but the actual scale is different for x,y and z (especially z), so it would be reasonable to separate the components. Their experiment is on lego which doesn't have this problem.
I guess it's the same, if your freq
is a vector3 with different components (a b c)
. The einsum
would be just x*freq
by the way.
oh i said it wrong, psnr score is better in gaussian but gaussian is 64 freqs while pe is 10.
my x is vec3 and freq is vec3. so the features are sin(a1 . x), sin(b1, y), sin(c1, z), sin(a2, x), sin (b2, y), sin(c2, z)...
Yes, it's the same as what I find to be better. Have you tried sin(ax+by+cz)
before? which is their original implementation
i haven't. but i am worried more about the noice. at first, i thought it is due to overfit, however, i did a grid search, even with small number of freq or scale, it is still noisy.
maybe implement an extra denoiser would solve the problem, i don't know yet.
What is your optimal parameters? By the way I also find using a uniform sampling in [0, M] better... since large frequency is indispensable (but not too much, it degrades the performance too), but Gaussian samples more lower frequencies around 0. I use uniform sampling in [0, 128] for the frequencies and I find 32 samples is already sufficient (better than pe), that makes 3x2x32=192 total features though.
64 & 128 (L_xyz, that means 3 + 3 64 2) is similar, so i picked 64. scale is better in ard 16.
What is your optimal parameters? By the way I also find using a uniform sampling in [0, M] better... since large frequency is indispensable (but not too much, it degrades the performance too), but Gaussian samples more lower frequencies around 0. I use uniform sampling in [0, 128] for the frequencies and I find 32 samples is already sufficient (better than pe), that makes 3x2x32=192 total features though.
you're right. it turns out that gaussian is very similar to uniform posenc, and better than logscale posenc, when equal number of freq used.
i guess it is due to fourier series: f(x) = a0/2 + sum_n(a_n sin(nx)+b_n cos(nx))
It is awesome to see all of the experimentation, there are a number of great observations. Our current view is that using the random Fourier features is the best approach in general (no priors on the scene). However if the scene is well aligned to an axis, the positional encoding presented in NeRF will ofter perform better since the frequencies components are all axis aligned. This bias can be observed in scenes like “lego” which is aligned to the axis and preforms better with positional encoding. We also observe that for both methods it is beneficial to increase the number of features for improved accuracy (note that this will also increase the computational cost).
@tancik Thanks for reaching out. How do you tell if a scene is "axis-aligned" or not? For the lego scene it is somewhat easy, but in real scene like "fern", how can you tell if it's axis aligned or not in advance? Or eventually we still need to perform experiments before we can decide which encoding is better?
Your intuition is correct that the axis alignment bias is almost exclusively limited to the synthetic scenes. For this reason, the random Fourier features is a good default for real scenes. In the ideal case you would sample the frequencies and directions to match the target scene. We leave these extensions to future work.
i did some more experiments, and i have some observations that doesn't fit the explaination. i did them on a real scene (one only) so i guess it's not axis aligned.
score: posenc > prime > gaussian (the curve fits perfectly, just around 0.25 higher) (the noise i mentioned previously is due to too high noise scale, i tuned down and it does not affect the experiments here.)
btw, my trials on sirens shows higher score but a lot more visible defacts and instability, so i am not going to test it further.
@qsobad Can you share your final siren structure? I tried many different parameters (different number of features, layers, frequencies) but none of them gives me any better results... The scores are all 5 points less than PE...
I feel that it might require some additional losses to make it converge. Like in the SDF experiment in the paper, I tried disabling some losses and it doesn't look good anymore.
it's very similar to posenc xyz -> 8sirennet -> 1linear + relu -> sigma xyz -> 8sirennet -> 1linear -> 1*sirennet -> 1linear + sigmoid -> rgb the score is not much higher though.
what do you mean by disabling some losses.
@kwea123 How do you know if the lego dataset is axis aligned?
@qsobad Disabling some losses make it worse, so I think maybe the reason it doesn't work for NeRF is that the reconstruction l2 loss is not enough to make it learn, we need some other losses. @sally-chen look at the image to see if the lego car is parallel to the axis or not, this is very subjective and not applicable to real scene.
@kwea123 Thanks for answering! And just wondering, if there are many frequencies included in positional encoding why is it still only limited to axis aligned scenes? (ie, don't different frequencies represent different axes? Where is the limitation from? )
PE has only features like sin(ax)
, so the effect of x,y,z is independent. In order to capture non axis aligned features, one would need features like sin(ax+by+cz)
which mix x,y,z together.
PE has only features like
sin(ax)
, so the effect of x,y,z is independent. In order to capture non axis aligned features, one would need features likesin(ax+by+cz)
which mix x,y,z together.
that would be like another direction (sth like [a,b,c]) of fourier feature?
I see, did not not think from this perspective at all! This is representing random vectors in frequency space vs. on axis vectors. Thanks a lot!
@qsobad Yes, exactly; it would have the direction (a, b, c).
@qsobad How did you initialize the weights for the SIREN layers? According to the paper, they propose to draw the weights from the distribution u(-sqrt(6/n), sqrt(6/n)) (where n is the number of neurons in the previous layer), and for the first layer, they use a modified activation function where they multiply Wx by 30 in the argument to the sin function.
If you use the i_embed -1
option to skip input encoding, in what range will the inputs to the network lie?
will it get better if use the new sirens activation function?