NVIDIAGameWorks / kaolin

A PyTorch Library for Accelerating 3D Deep Learning Research
Apache License 2.0
4.43k stars 543 forks source link

About randomness in kaolin DIBR #638

Open lai-pf opened 1 year ago

lai-pf commented 1 year ago

about the randomness. Are there any randomness in kaolin rendering? I follow a paper which use kaolin as render, although I have fixed all the seed, I found it's gradient are not the same. It seems that gradient in backward is different leads to different optimization results. Are this randomness comes from kaolin render? image In the process of optimization,little randomness in iteration one can make the optimization results really different. we use cuda as rast-backend in DIBR function. image I found the issue about randomness in nvdiffrast https://github.com/NVlabs/nvdiffrast/issues/13#issuecomment-767484493 but my code use the cuda rast-backend not the nvdiffrast. So I just want to know that are this randomness comes from kaolin and Inevitable? or this randomness comes from my code part? thanks very much, I would appreciate it if you could answer my question, this can be really helpful.

lai-pf commented 1 year ago

Hi, can someone help me? I've checked my code again, and I still found randomness in the gradient of my network last layer. I think it is kaolin that leads to this non-determinism. I don't know where I'm wrong, request help again.

Caenorst commented 1 year ago

Hi @lai-pf , to my knowledge there is no source of non-determinism in Kaolin's rendering, you can easily check by doing something like that:

input1 = input1.detach().clone()
input1.requires_grad()
input2 = input1.detach().clone()
input2.requires_grad = True
output1 = function_to_test(input1)
grad_output = torch.rand_like(output1)
output1.backward(grad_output)
output2 = function_to_test(input2)
output2.backward(grad_output)
print(torch.equal(input1.grad, input2.grad))
lai-pf commented 1 year ago

Hi Caenorst, thanks for your help, it really help me a lot. I follow the test demo that you suggested, the gradient is the same when I use single mesh. when I use DIBR to render two mesh in one scene, I find the randomness. It seems that when we concat 2 mesh faces and vertices in one mesh, it leads to somthing error in judge the weights of vertices which compose a pixel. Have you seen this problem before? To reproduce the problem,you can use DIBR and concat two mesh like this mesh_A,mesh_B,scene_Mesh vertices_num_A = mesh_A.vertices.shape[0] mesh_B.faces[:,:] += (vertices_num_A) tmp_vertices = concat( [mesh_A.vertices,mesh_B.vertices] ) tmp_faces = concat( [mesh_A.faces,mesh_B.faces] ) scene_Mesh.vertices = tmp_vertices scene_Mesh.faces = tmp_faces And simply build a network with only one mlp layer. Through this layer's gradient, with this https://github.com/NVIDIAGameWorks/kaolin/issues/638#issuecomment-1290792293 test demo, the problem can be reproduced. If there are any resolution can solve this problem, please let me know. Thanks again again again. My work has been stopped here for a long time,and your help makes me feel hopeful again.

Caenorst commented 1 year ago

Hi @lai-pf , you need to add an offset to mesh_B.faces:

tmp_faces = concat([mesh_A.faces, mesh_B.faces + mesh_A.vertices.shape[0]])
Caenorst commented 1 year ago

Do you have this gradient difference if you just render a single mesh? (mesh_A?)

lai-pf commented 1 year ago

I use the test demo that you suggested in the reply https://github.com/NVIDIAGameWorks/kaolin/issues/638#issuecomment-1290792293 in single mesh rendering I haven't seen difference in gradient, it seems right.

Caenorst commented 1 year ago

Also in mesh_B?

lai-pf commented 1 year ago

I didn't try B. in kaolin's DIBR, should the mesh be watertight mesh ?

Caenorst commented 1 year ago

Rasterization should work with non watertigh meshes

lai-pf commented 1 year ago

Hi , Caenorst, I've tried single mesh_B a hat, the single mesh hat has different gradient. And when I use single mesh_A the gradient is right.

Caenorst commented 1 year ago

So the single mesh_B leads to non-deterministic gradient? Can you share the model?

lai-pf commented 1 year ago

can you send me a email? My gmail is dr.henrylai@gmail.com, I can sent the mesh with code through email

Caenorst commented 1 year ago

There is indeed a source of non-determinism, which is probably coming from the atomicAdd here: https://github.com/NVIDIAGameWorks/kaolin/blob/master/kaolin/csrc/render/mesh/rasterization_cuda.cu#L391-L398

The differences in values are in the 1e-6 magnitude which should be very negligible, I would argue that if that leads to a fails vs success in an optimization pipeline then probably something else is wrong.

Unfortunately making an efficient deterministic version is not that straightforward. One way you can do (but would strongly affect the speed of the kernel) is the following: 1) change blocks to 1 2) put the atomicAdd in a for loop as following:

for (int i = 0; i < blockDim.x; i++) {
    if (threadIdx.x == i) {
        atomicAdd(grad_face_vertices_image + start_image_idx + 0, dldI * dIdax);
        atomicAdd(grad_face_vertices_image + start_image_idx + 1, dldI * dIday);

        atomicAdd(grad_face_vertices_image + start_image_idx + 2, dldI * dIdbx);
        atomicAdd(grad_face_vertices_image + start_image_idx + 3, dldI * dIdby);

        atomicAdd(grad_face_vertices_image + start_image_idx + 4, dldI * dIdcx);
        atomicAdd(grad_face_vertices_image + start_image_idx + 5, dldI * dIdcy);
    }
    __syncthreads();
}
Caenorst commented 1 year ago

You wanna do the same thing here

That have resolved most of the non-determinism (for some reason I still have some rare occurences but I can't pin where it is coming from now)

lai-pf commented 1 year ago

你想在这里做同样的事情

这已经解决了大部分不确定性(由于某种原因,我仍然有一些罕见的情况,但我无法确定它现在来自哪里)

I follow the suggestion. change 1 image change 2 image

change 3 image

but the gradient in case of hat mesh is still different like before. Is there any change I forgot? And in my task, running speed is not important to me, but reproducibility matters, so if there is any possible solution to make DIBR stable and deterministic? Or do we have CPU version in kaolin's DIBR that can make sure to have the same result each time?

lai-pf commented 1 year ago

And I found something strange, I use other mesh, (person.obj and shoe.obj) as a scene to render, the gradient is the same. It seems like only the hat mesh is special? Is the hat different in some way? In my test today,I found that some meshes will make gradient different, some won't, I think those meshes belong to one special class. Do you know anything about this?

Caenorst commented 1 year ago

I found another source of non-determinism, replace the kaolin.ops.mesh.index_vertices_by_faces by the following:

def index_vertices_by_faces(vertices_features, faces):
    r"""Index vertex features to convert per vertex tensor to per vertex per face tensor.
    Args:
        vertices_features (torch.FloatTensor):
            vertices features, of shape
            :math:`(\text{batch_size}, \text{num_points}, \text{knum})`,
            ``knum`` is feature dimension, the features could be xyz position,
            rgb color, or even neural network features.
        faces (torch.LongTensor):
            face index, of shape :math:`(\text{num_faces}, \text{num_vertices})`.
    Returns:
        (torch.FloatTensor):
            the face features, of shape
            :math:`(\text{batch_size}, \text{num_faces}, \text{num_vertices}, \text{knum})`.
    """
    assert vertices_features.ndim == 3, \
        "vertices_features must have 3 dimensions of shape (batch_size, num_points, knum)"
    assert faces.ndim == 2, "faces must have 2 dimensions of shape (num_faces, num_vertices)"
    return vertices_features[:, faces]