Mismatch in point_index and num_steps

nogu-atsu commented 2 years ago

Thank you for the great work!

I think rays computed by kernel_march_rays_train store cumulative numbers of sampled points (point_index) in the second column, and numbers of sampled points on rays (num_steps) in the third column. So rays should satisfy the following condition. (rays[i + 1, 1] - rays[i, 1]) == rays[i, 2]

However, a small percentage of rays do not meet this condition. If I train D-NeRF on jumpingjacks and compute ((rays[1:, 1] - rays[:-1, 1]) == rays[:-1, 2]).float().mean().item(), this is not 1 (in the range of 0.987~0.997).

I think this is because the two atomicAdd in https://github.com/ashawkey/torch-ngp/blob/main/raymarching/src/raymarching.cu#L405 are not always executed in the same order.

Is there a workaround? I am implementing a code that uses more than 3 atomicAdd and I am facing the same problem.

Thanks.

ashawkey commented 2 years ago

@nogu-atsu Hi, I think this is caused by a default setting of 'force_all_rays = False' in training, which may ignore some rays for acceleration (it triggers this). You should change this option to true if you want to make sure all rays will be calculated.

nogu-atsu commented 2 years ago

@ashawkey Thank you for the prompt response. I tried force_all_rays = True but the problem remained. This is done after writing out the rays, so I guess it does not affect the contents of the rays.

kwea123 commented 2 years ago

@ashawkey I wonder why this force_all_rays exists in the first place. Some rays are ignored to accelerate (less data so faster that is certain), but why doing so? In my experiments this causes the training to be unstable (loss curve is zigzag) and the final result is worse compared to not omitting.

ashawkey commented 2 years ago

@nogu-atsu Oh I got it wrong. The real reason maybe the race condition between the two atomicAdds, e.g.,

[thread1] get point_index = 0, num_steps = 10
[thread2] get point_index = 10, num_steps = 3 # race happens
[thread2] get ray_index = 0
[thread1] get ray_index = 1

and this lead to rays:

[1, 10, 3]
[0, 0, 10]

which leads to the problem you described. So, the real question can be related to this. It seems there is no universal way to make sure the two atomicAdds are always run consecutively. Maybe you could elaborate on why you need this?

ashawkey commented 2 years ago

@kwea123 I remember that I wanted to imitate what instant-ngp does around here. Since the total samples from all rays are usually much less than the max samples (n_rays * max_steps), it estimates the average samples from past training steps, so we can save both memory and time. It does make the final performance a little worse though.

nogu-atsu commented 2 years ago

@ashawkey I see.

I wonder if this problem causes performance degradation in NeRF training.

I tried modifying kernel_march_rays_train to store more ray information to rays by using more atomicAdd and found this problem. The more atomicAdd there are, the more serious this race condition seems to become. I finally solved the problem by splitting the function into two parts before and after atomicAdd, and replacing atomicAdd with torch.cumsum outside the cuda kernel. However, this increases execution time, so I wanted to know if there was a smarter solution.

Thanks.

kwea123 commented 2 years ago

I wanted to imitate what instant-ngp

Okay, but in my experience

Just doing M=n_rays*max_steps doesn't cost too much time (and maybe saves memory, I don't remember). And ignoring these rays might actually cause the performance gap between your method and instant-ngp (not yet confirmed but possibly)
The return condition is actually wrong for the last ray, someone pointed this out in my issue: https://github.com/kwea123/ngp_pl/issues/23 Imagine a trivial example with 1 ray and 1024 sample points, in this case M=1024, point_index=0 and num_steps=1024, so point_index + num_steps >= M is always true and it returns zero rgb (which is of course wrong)

ashawkey commented 2 years ago

@kwea123 In fact the zero RGB of the last ray is intended if I understand correctly, which is also in the ray march training code of instant-ngp. This ray (and maybe more rays) will be ignored in both forward and backward of this current training step, like we are dynamically adjusting the batch size of rays. But I did found something wrong, instant-ngp puts this return between the two atomicAdds, while I put it behind... I'll carry on more experiments later to test its influence on performance, memory and time. Thanks for bringing this up!

kwea123 commented 2 years ago

His code is > while yours is >=, I believe it's not the same. Using my example his condition is 0+1024>1024 which is false, so the last ray will not be with zero rgb.

ashawkey commented 2 years ago

Yes, this is indeed a bug. And there must be more 😂.

bchretien commented 2 years ago

Hi,

For the 2 atomicAdds, can't we replace the int32[2] with a uint64, and do a single atomicAdd? That way we don't need a lock to deal with the critical section, we only need to pack/unpack the 2 indices to/from uint64 which is straightforward.

EDIT: since PyTorch does not support uint64, some casting is necessary.

ashawkey / torch-ngp

Mismatch in point_index and num_steps #107