Open nogu-atsu opened 2 years ago
@nogu-atsu Hi, I think this is caused by a default setting of 'force_all_rays = False' in training, which may ignore some rays for acceleration (it triggers this). You should change this option to true if you want to make sure all rays will be calculated.
@ashawkey Thank you for the prompt response.
I tried force_all_rays = True
but the problem remained.
This is done after writing out the rays
, so I guess it does not affect the contents of the rays
.
@ashawkey I wonder why this force_all_rays
exists in the first place. Some rays are ignored to accelerate (less data so faster that is certain), but why doing so? In my experiments this causes the training to be unstable (loss curve is zigzag) and the final result is worse compared to not omitting.
@nogu-atsu Oh I got it wrong. The real reason maybe the race condition between the two atomicAdds, e.g.,
[thread1] get point_index = 0, num_steps = 10
[thread2] get point_index = 10, num_steps = 3 # race happens
[thread2] get ray_index = 0
[thread1] get ray_index = 1
and this lead to rays
:
[1, 10, 3]
[0, 0, 10]
which leads to the problem you described. So, the real question can be related to this. It seems there is no universal way to make sure the two atomicAdds are always run consecutively. Maybe you could elaborate on why you need this?
@kwea123 I remember that I wanted to imitate what instant-ngp does around here. Since the total samples from all rays are usually much less than the max samples (n_rays * max_steps), it estimates the average samples from past training steps, so we can save both memory and time. It does make the final performance a little worse though.
@ashawkey I see.
I wonder if this problem causes performance degradation in NeRF training.
I tried modifying kernel_march_rays_train
to store more ray information to rays
by using more atomicAdd and found this problem. The more atomicAdd there are, the more serious this race condition seems to become.
I finally solved the problem by splitting the function into two parts before and after atomicAdd, and replacing atomicAdd with torch.cumsum
outside the cuda kernel. However, this increases execution time, so I wanted to know if there was a smarter solution.
Thanks.
I wanted to imitate what instant-ngp
Okay, but in my experience
M=n_rays*max_steps
doesn't cost too much time (and maybe saves memory, I don't remember). And ignoring these rays might actually cause the performance gap between your method and instant-ngp (not yet confirmed but possibly)return
condition is actually wrong for the last ray, someone pointed this out in my issue: https://github.com/kwea123/ngp_pl/issues/23
Imagine a trivial example with 1 ray and 1024 sample points, in this case M=1024
, point_index=0
and num_steps=1024
, so point_index + num_steps >= M
is always true
and it returns zero rgb (which is of course wrong)@kwea123 In fact the zero RGB of the last ray is intended if I understand correctly, which is also in the ray march training code of instant-ngp. This ray (and maybe more rays) will be ignored in both forward and backward of this current training step, like we are dynamically adjusting the batch size of rays. But I did found something wrong, instant-ngp puts this return between the two atomicAdds, while I put it behind... I'll carry on more experiments later to test its influence on performance, memory and time. Thanks for bringing this up!
His code is >
while yours is >=
, I believe it's not the same. Using my example his condition is 0+1024>1024
which is false
, so the last ray will not be with zero rgb.
Yes, this is indeed a bug. And there must be more 😂.
Hi,
For the 2 atomicAdd
s, can't we replace the int32[2]
with a uint64
, and do a single atomicAdd
? That way we don't need a lock to deal with the critical section, we only need to pack/unpack the 2 indices to/from uint64
which is straightforward.
EDIT: since PyTorch does not support uint64
, some casting is necessary.
Thank you for the great work!
I think
rays
computed bykernel_march_rays_train
store cumulative numbers of sampled points (point_index
) in the second column, and numbers of sampled points on rays (num_steps
) in the third column. Sorays
should satisfy the following condition.(rays[i + 1, 1] - rays[i, 1]) == rays[i, 2]
However, a small percentage of rays do not meet this condition. If I train D-NeRF on jumpingjacks and compute
((rays[1:, 1] - rays[:-1, 1]) == rays[:-1, 2]).float().mean().item()
, this is not 1 (in the range of 0.987~0.997).I think this is because the two atomicAdd in https://github.com/ashawkey/torch-ngp/blob/main/raymarching/src/raymarching.cu#L405 are not always executed in the same order.
Is there a workaround? I am implementing a code that uses more than 3 atomicAdd and I am facing the same problem.
Thanks.