bmild / nerf

Code release for NeRF (Neural Radiance Fields)
http://tancik.com/nerf
MIT License
9.6k stars 1.34k forks source link

siren #60

Closed qsobad closed 3 years ago

qsobad commented 4 years ago

will it get better if use the new sirens activation function?

tancik commented 4 years ago

We have not yet tested with the sin activation functions proposed in the SIREN paper. Please let us know if you find them beneficial.

qsobad commented 4 years ago

if it is as it said in the paper, it is possible to replace positional encoding with relu, so possibly reduce number of input channels here.

kwea123 commented 4 years ago

@qsobad they recently also researched on other encodings, the results are compelling, you can compare to siren: https://github.com/tancik/fourier-feature-networks

qsobad commented 4 years ago

@kwea123 thanks! i read and gaussian seems fine so far but i still couldn't put sirens to work and reproduce the result it claims. it does not converge all the time and neither does it better than p.e. when converge. i am still investigating what goes wrong.

kwea123 commented 4 years ago

@qsobad What is your model structure? I also tried siren in NeRF but experienced the same thing as you: its loss is higher and seems to overfit (high val loss) I tried 7 layers xyz -> sigma and 7 layers xyzdir -> rgb. Maybe I should try the model similar to nerf (use intermediate inputs)

qsobad commented 4 years ago

i tried a lot of things, with diff initiation as well. i guess it's nothing about layers, nor skip connection, as from the result of my trials.

actually siren and gaussian shares a lot of things in common: the normal initalization and fourier features. my understanding is that in siren the weight is the freq, while gaussian specifically pick the freq, that's why it's harder to control in siren. (correct me if i am wrong) that's the difference i see, i just can't make the best use of siren so far given its freedom.

kwea123 commented 4 years ago

Have you tried the Gaussian embedding? I don't find it better either... I modify only the embedding in this repo, so the model structure is a lot more complicated than the experiments done in the paper which is oversimplified (no coarse to fine, no direction embedding...). I tuned lots of parameters, like the standard deviation, the number of embedding sizes, etc, but none of them helps.

It makes me think if the Gaussian only works in the simple settings as in the paper? Like it improves the PSNR from 24 to 25... but in reality 25 is really blurry too. Another possibility is that maybe the embedding is not "friendly" to this NeRF volume rendering? In normal image reconstruction the score is a lot higher than NeRF..

qsobad commented 3 years ago

in my case, i use 64 freqencies in gaussian, and i observe clearer edge (not blur), however it has a lot of noise (black dots).

kwea123 commented 3 years ago

What scene do you use? And how about the psnr score?

I tested on a forward facing real scene, my current finding is that instead of using sin(ax+by+cz) as in the paper (which degrades the result largely), using combination of sin(ax) sin(by) sin(cz) gives me better result in PSNR (than the positional encoding). But I have only tested on one scene so I don't know if it generalizes

qsobad commented 3 years ago

i use my own scene (360 circular real scene), score is worse than pe. mine is: for freq in self.freq_bands_gaussian: for func in self.funcs: out += [func(torch.einsum("ij,j->ij", x, freq))] i guess it is the same as sin(ax), sin(by), sin(cz). or not?

kwea123 commented 3 years ago

Is your scene forward facing? I'm thinking if it's the NDC part that poses problem. NDC maps the whole space into [-1, 1]^3 but the actual scale is different for x,y and z (especially z), so it would be reasonable to separate the components. Their experiment is on lego which doesn't have this problem.

I guess it's the same, if your freq is a vector3 with different components (a b c). The einsum would be just x*freq by the way.

qsobad commented 3 years ago

oh i said it wrong, psnr score is better in gaussian but gaussian is 64 freqs while pe is 10.

my x is vec3 and freq is vec3. so the features are sin(a1 . x), sin(b1, y), sin(c1, z), sin(a2, x), sin (b2, y), sin(c2, z)...

kwea123 commented 3 years ago

Yes, it's the same as what I find to be better. Have you tried sin(ax+by+cz) before? which is their original implementation

qsobad commented 3 years ago

i haven't. but i am worried more about the noice. at first, i thought it is due to overfit, however, i did a grid search, even with small number of freq or scale, it is still noisy.

maybe implement an extra denoiser would solve the problem, i don't know yet.

kwea123 commented 3 years ago

What is your optimal parameters? By the way I also find using a uniform sampling in [0, M] better... since large frequency is indispensable (but not too much, it degrades the performance too), but Gaussian samples more lower frequencies around 0. I use uniform sampling in [0, 128] for the frequencies and I find 32 samples is already sufficient (better than pe), that makes 3x2x32=192 total features though.

qsobad commented 3 years ago

64 & 128 (L_xyz, that means 3 + 3 64 2) is similar, so i picked 64. scale is better in ard 16.

qsobad commented 3 years ago

What is your optimal parameters? By the way I also find using a uniform sampling in [0, M] better... since large frequency is indispensable (but not too much, it degrades the performance too), but Gaussian samples more lower frequencies around 0. I use uniform sampling in [0, 128] for the frequencies and I find 32 samples is already sufficient (better than pe), that makes 3x2x32=192 total features though.

you're right. it turns out that gaussian is very similar to uniform posenc, and better than logscale posenc, when equal number of freq used.

qsobad commented 3 years ago

i guess it is due to fourier series: f(x) = a0/2 + sum_n(a_n sin(nx)+b_n cos(nx))

tancik commented 3 years ago

It is awesome to see all of the experimentation, there are a number of great observations. Our current view is that using the random Fourier features is the best approach in general (no priors on the scene). However if the scene is well aligned to an axis, the positional encoding presented in NeRF will ofter perform better since the frequencies components are all axis aligned. This bias can be observed in scenes like “lego” which is aligned to the axis and preforms better with positional encoding. We also observe that for both methods it is beneficial to increase the number of features for improved accuracy (note that this will also increase the computational cost).

kwea123 commented 3 years ago

@tancik Thanks for reaching out. How do you tell if a scene is "axis-aligned" or not? For the lego scene it is somewhat easy, but in real scene like "fern", how can you tell if it's axis aligned or not in advance? Or eventually we still need to perform experiments before we can decide which encoding is better?

tancik commented 3 years ago

Your intuition is correct that the axis alignment bias is almost exclusively limited to the synthetic scenes. For this reason, the random Fourier features is a good default for real scenes. In the ideal case you would sample the frequencies and directions to match the target scene. We leave these extensions to future work.

qsobad commented 3 years ago

i did some more experiments, and i have some observations that doesn't fit the explaination. i did them on a real scene (one only) so i guess it's not axis aligned.

  1. i found that gaussian has more observable defacts although it looks more smooth, and actually it has lower score too.
  2. i used prime number as well, and it shows details more quickly, but it has lower score that posenc (i guess it's because of missing frequencies, the effect is somehow similar to scaled posenc, e.g. 2,4,6,8...).

score: posenc > prime > gaussian (the curve fits perfectly, just around 0.25 higher) (the noise i mentioned previously is due to too high noise scale, i tuned down and it does not affect the experiments here.)

btw, my trials on sirens shows higher score but a lot more visible defacts and instability, so i am not going to test it further.

kwea123 commented 3 years ago

@qsobad Can you share your final siren structure? I tried many different parameters (different number of features, layers, frequencies) but none of them gives me any better results... The scores are all 5 points less than PE...

I feel that it might require some additional losses to make it converge. Like in the SDF experiment in the paper, I tried disabling some losses and it doesn't look good anymore.

qsobad commented 3 years ago

it's very similar to posenc xyz -> 8sirennet -> 1linear + relu -> sigma xyz -> 8sirennet -> 1linear -> 1*sirennet -> 1linear + sigmoid -> rgb the score is not much higher though.

what do you mean by disabling some losses.

sally-chen commented 3 years ago

@kwea123 How do you know if the lego dataset is axis aligned?

kwea123 commented 3 years ago

@qsobad Disabling some losses make it worse, so I think maybe the reason it doesn't work for NeRF is that the reconstruction l2 loss is not enough to make it learn, we need some other losses. @sally-chen look at the image to see if the lego car is parallel to the axis or not, this is very subjective and not applicable to real scene.

sally-chen commented 3 years ago

@kwea123 Thanks for answering! And just wondering, if there are many frequencies included in positional encoding why is it still only limited to axis aligned scenes? (ie, don't different frequencies represent different axes? Where is the limitation from? )

kwea123 commented 3 years ago

PE has only features like sin(ax), so the effect of x,y,z is independent. In order to capture non axis aligned features, one would need features like sin(ax+by+cz) which mix x,y,z together.

qsobad commented 3 years ago

PE has only features like sin(ax), so the effect of x,y,z is independent. In order to capture non axis aligned features, one would need features like sin(ax+by+cz) which mix x,y,z together.

that would be like another direction (sth like [a,b,c]) of fourier feature?

sally-chen commented 3 years ago

I see, did not not think from this perspective at all! This is representing random vectors in frequency space vs. on axis vectors. Thanks a lot!

krikru commented 3 years ago

@qsobad Yes, exactly; it would have the direction (a, b, c).

krikru commented 3 years ago

@qsobad How did you initialize the weights for the SIREN layers? According to the paper, they propose to draw the weights from the distribution u(-sqrt(6/n), sqrt(6/n)) (where n is the number of neurons in the previous layer), and for the first layer, they use a modified activation function where they multiply Wx by 30 in the argument to the sin function.

krikru commented 3 years ago

If you use the i_embed -1 option to skip input encoding, in what range will the inputs to the network lie?