CERN / TIGRE

TIGRE: Tomographic Iterative GPU-based Reconstruction Toolbox
BSD 3-Clause "New" or "Revised" License
582 stars 191 forks source link

Artifacts in 3D-TV Denoised Images #594

Closed zezisme closed 2 weeks ago

zezisme commented 1 month ago

Hello, I found that when using im3DDenoise to denoise 3D images, as the number of iterations increases, edge slices will produce image artifacts, and these artifacts seem to come from other slices. Is there any reference document for this algorithm? Can you provide the corresponding mathematical derivation document ? I want know is the issue come from the mathematical principle or its implementation method?

Actual Behavior

TV_lambda = 200;

FDK image FDK+TVdenoise(50 iterations) image FDK+TVdenoise(100 iterations) image FDK+TVdenoise(200 iterations) image

The artifact seems come from the 267th slice image(Or near this slice, And the total slice num is 400) It seems to be related to a ratio of 1/3((400-267)/400≈1/3) image

Code to reproduce the problem (If applicable)

x=FDK(rawdata_proj,geo,angles,'filter',filter_type);
% TV denoising
TV_niter = 200; % 50,100,200,..
TV_lambda = 200; % 10,20,.....200
x=im3DDenoise(x,'TV',TV_niter,TV_lambda,'gpuids',GpuIds());

Specifications

AnderBiguri commented 1 month ago

Interesting. Is that ghost that you get the first slice? This was coded 10 years ago, so I don't exactly remember the implementation, but it could be caused to periodic boundary conditions.

AnderBiguri commented 1 month ago

In any case, you can find the math and the relevant paper in my PhD thesis (in my github profile) and the code is in Common/CUDA/tvdenoising.cu. It does not seem like it has periodic boundary conditions. Can you identify which slice this ghost is coming from? Can yo also tell me the size of the image and number of GPUs (including models)? Can you reproduce this in a small image, on a single GPU using TIGREs default data?

zezisme commented 1 month ago

In any case, you can find the math and the relevant paper in my PhD thesis (in my github profile) and the code is in Common/CUDA/tvdenoising.cu. It does not seem like it has periodic boundary conditions. Can you identify which slice this ghost is coming from? Can yo also tell me the size of the image and number of GPUs (including models)? Can you reproduce this in a small image, on a single GPU using TIGREs default data?

The CASE1: INPUT IMAGES SIZE = (1529,1529,400),And I just have one GPU The input image(slice 400,last slice,No ghost ) image Using the default data(iter=50; hyper=15;single GPU),the output denoised image is(there is a obvious ghost ): The 400th slice image(there is a obvious ghost ) image The 399th slcie image image The 398th slcie image image The ghost are becoming weaker And the ghost is seems come from the 267th slice image image

CASE2:INPUT IMAGES SIZE = (1529,1529,100),Using the default data(iter=50; hyper=15;single GPU) the ouput image(last image) has no ghost(great!!!) image

CASE3:INPUT IMAGES SIZE = (1529,1529,200),Using the default data(iter=50; hyper=15;single GPU) the ouput image(last image) has ghost again image And the ghost is seems come from the 100th slice image image

Final Test: I find that if the GPU memory is not enough, than there is a warn, the algorithm will using the CPU memory , And the ghost is coming! if the GPU memory is enough, the ghost will not occur. image image

Conclution: This bug is most likely caused by a problem with data acquisition during the communication between the CPU and GPU.

AnderBiguri commented 1 month ago

Fnatstic experiment, indeed this is why I asked GPU size etc. Seems that something is broken in the memory-in-out where either the memory doesn't get appropriately reset, or the chunk of memory being copied is wrong. I will investigate.

zezisme commented 1 month ago

Fnatstic experiment, indeed this is why I asked GPU size etc. Seems that something is broken in the memory-in-out where either the memory doesn't get appropriately reset, or the chunk of memory being copied is wrong. I will investigate.

Great! Looking forward to your solution

AnderBiguri commented 1 month ago

A posibility:

https://github.com/CERN/TIGRE/blob/master/Common/CUDA/tvdenoising.cu#L378-L389

may need to be changed to:

for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // New line:
                        cudaMemsetAsync(d_src[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src+offset_host[dev]  , bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                    }
                    for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // All these are async
                        // New line:
                        cudaMemsetAsync(d_u[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_u[dev]  +offset_device[dev], d_src[dev]+offset_device[dev], bytes_device[dev]*sizeof(float), cudaMemcpyDeviceToDevice,stream[dev*nStream_device+1]);
                        cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                    }

I can not try now, but if you want to try that, great! If not, give me a bit of time and i'll give it a shot myself.

zezisme commented 1 month ago

A posibility:

https://github.com/CERN/TIGRE/blob/master/Common/CUDA/tvdenoising.cu#L378-L389

may need to be changed to:

for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // New line:
                        cudaMemsetAsync(d_src[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src+offset_host[dev]  , bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                    }
                    for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // All these are async
                        // New line:
                        cudaMemsetAsync(d_u[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_u[dev]  +offset_device[dev], d_src[dev]+offset_device[dev], bytes_device[dev]*sizeof(float), cudaMemcpyDeviceToDevice,stream[dev*nStream_device+1]);
                        cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                    }

I can not try now, but if you want to try that, great! If not, give me a bit of time and i'll give it a shot myself.

I have tested your new code, but the results are still the same as before, the same ghost is still there

AnderBiguri commented 1 month ago

Okey, I will investigate further, but this may be hard to fix, as during all my years, I have never seen this happen, which means its a relatively nitche case, and I am not sure I will be able to reproduce. I'll try!

AnderBiguri commented 1 month ago

Can you tell me exactly which GPU you have? I can try to simulate your machine if I know exactly how much memory your GPU has.

zezisme commented 1 month ago

Can you tell me exactly which GPU you have? I can try to simulate your machine if I know exactly how much memory your GPU has.

Here is my GPU information! Thank you very much that you can fix this bug as soon as possible. In addition, I am also learning cuda programming so that I can fix the problem by myself if possible. image

zezisme commented 1 month ago

Okey, I will investigate further, but this may be hard to fix, as during all my years, I have never seen this happen, which means its a relatively nitche case, and I am not sure I will be able to reproduce. I'll try!

Hello @AnderBiguri : I seem to have found the bugs. After reading the source code carefully, I found two problems: (1) In iterations >1, the device memory is not reset to zero, which is very important for the first_chunk and the last_chunk (because buffer_pixels exists, and the last chunk is not large enough); (2) After each iteration of sp, cudaMemcpyDeviceToHost, h_u, h_px, h_py, and h_pz will be changed. At this time, the next sp iteration (sp+1) requires cudaMemcpyHostToDevice. However, since the copied data bytes_device is curr_pixels+buffer_pixels(https://github.com/CERN/TIGRE/blob/f0e91005ba2881e1a5e63eb744740ca0c01af6c8/Common/CUDA/tvdenoising.cu#L375), the host data after the previous sp iteration is partially copied to the sp+1 iteration, causing data aliasing.

the first one may be the main reason of the ghost, And I tran to insert the fellow new codes in https://github.com/CERN/TIGRE/blob/f0e91005ba2881e1a5e63eb744740ca0c01af6c8/Common/CUDA/tvdenoising.cu#L402-L434 The new codes:

                   for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line:
                        cudaMemsetAsync(d_u[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_u [dev] +offset_device[dev], h_u +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);

                    }

                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+2]);
                        // New line:
                        cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+2]);
                        cudaMemcpyAsync(d_px[dev]+offset_device[dev], h_px+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);

                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+3]);
                       // New line:
                        cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+3]);
                        cudaMemcpyAsync(d_py[dev] +offset_device[dev], h_py+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);

                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+4]);
                       // New line:
                        cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+4]);
                        cudaMemcpyAsync(d_pz[dev] +offset_device[dev], h_pz+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);

                    } 
                    for (dev = 0; dev < deviceCount; dev++){   

                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                       // New line:
                        cudaMemsetAsync(d_src[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);

                    }

It is just add a line to set the d_u、d_px、d_py、d_pz、d_src to be zero before setting the new value. And finnal this attempt work!!! there is no ghost whatever the para I used. image

Of course, there should be a more efficient solution, but I haven't tried it yet; And I haven't thought of a good solution for problem 2.

zezisme commented 1 month ago

Here comes a new problem. For the first and last pictures, since 0 is used as the buffer data, there will be obvious noise. The first and end slice picture(with noise) image image Here is the middle slice picture(smooth, without noise) image

Can we use the adjacent pictures that be flipped in z-axis as the buffer data, so that it can smooth the gradient of data. I am not good at cuda programming yet, so I don't know how to implement it

zezisme commented 1 month ago

I have tried using the flip data of the first and end slice as the buffer data, and here is the new code(Note: Multi-GPU is not considered here): https://github.com/CERN/TIGRE/blob/f0e91005ba2881e1a5e63eb744740ca0c01af6c8/Common/CUDA/tvdenoising.cu#L378-L389

for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // New line: not care the multi GPU
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+pixels_per_slice*j, src+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+bytes_device[dev]+pixels_per_slice*j, src+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src+offset_host[dev]  , bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                    }
                    for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // All these are async
                        // New line:
                        cudaMemcpyAsync(d_u[dev], d_src[dev], mem_img_each_GPU, cudaMemcpyDeviceToDevice,stream[dev*nStream_device+1]);
                        cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                    }

https://github.com/CERN/TIGRE/blob/f0e91005ba2881e1a5e63eb744740ca0c01af6c8/Common/CUDA/tvdenoising.cu#L402-L434

for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line: not care of the multi GPU()
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_u[dev]+pixels_per_slice*j, h_u+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_u[dev]+bytes_device[dev]+pixels_per_slice*j, h_u+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        //cudaMemsetAsync(d_u[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);//zhouenze
                        cudaMemcpyAsync(d_u [dev] +offset_device[dev], h_u +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);

                    }

                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+2]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_px[dev]+pixels_per_slice*j, h_px+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_px[dev]+bytes_device[dev]+pixels_per_slice*j, h_px+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                            }
                        }
                        // cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+2]);//zhouenze
                        cudaMemcpyAsync(d_px[dev]+offset_device[dev], h_px+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);

                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+3]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_py[dev]+pixels_per_slice*j, h_py+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_py[dev]+bytes_device[dev]+pixels_per_slice*j, h_py+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                            }
                        }
                        // cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+3]);//zhouenze
                        cudaMemcpyAsync(d_py[dev] +offset_device[dev], h_py+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);

                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+4]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_pz[dev]+pixels_per_slice*j, h_pz+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_pz[dev]+bytes_device[dev]+pixels_per_slice*j, h_pz+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                            }
                        }
                        // cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+4]);//zhouenze
                        cudaMemcpyAsync(d_pz[dev] +offset_device[dev], h_pz+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);

                    } 
                    for (dev = 0; dev < deviceCount; dev++){  
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+pixels_per_slice*j, src+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+bytes_device[dev]+pixels_per_slice*j, src+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        //cudaMemsetAsync(d_src[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);//zhouenze
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);

                    }

Here is the result: the left image: init denoised image of end slice(have ghost); middle image: the denoised end image(using zero as buffer data) (no ghost, but have severe noise),right image: the denoised end image(using flipped data as buffer data) (no ghost, but have slight noise enhancement than the init image, why????). It should be noted that these methods differ only in the slices at the two ends, and all other slices are the same. image

AnderBiguri commented 1 month ago

hi @zezisme , I am not sure I undetstood very well the second issue or these last images. Can you edit the text to explain them a bit longer?

zezisme commented 1 month ago

hi @zezisme , I am not sure I undetstood very well the second issue or these last images. Can you edit the text to explain them a bit longer?

This is the flow chart of the second problem, but I am not sure whether this problem has a little big impact on denoising. 320f4fac280ac9e8b3f06602484f944

AnderBiguri commented 1 month ago

Right, let me rephrase it to see if I fully understood:

The issue you are highlighting is that after the first split of the image is processed, the second one will use updated data as a buffer, rather than the original data.

Yes you are correct then, this is what happens. Looking at my notes, I seem to have tested this when I coded the multi-GPU code and I decided that even if this is wrong, it has a limited impact on the denoising, so I was not going to do much about it. The solution otherwise requires a full extra copy of the image in memory, which I decided that was too costly for the little effect it had.

However, this should have no effect in the end-slides I think. I believe that the end-slides is a problem of the boundary condition. This code assumes Neumann boundary conditions (i.e. gradient is zero in the slice direction at the first/last slice). This means that the gradient at the edges will always be smaller than everywhere else, and so will it be then the denoising strength. Perhaps using periodic boundary conditions would fix the issue (if this is the issue). Critically, it would mean to change these lines to access the next (rather than previous) pixel value in the else case:

https://github.com/CERN/TIGRE/blob/master/Common/CUDA/tvdenoising.cu#L70-L123

zezisme commented 1 month ago

Right, let me rephrase it to see if I fully understood:

The issue you are highlighting is that after the first split of the image is processed, the second one will use updated data as a buffer, rather than the original data.

Yes you are correct then, this is what happens. Looking at my notes, I seem to have tested this when I coded the multi-GPU code and I decided that even if this is wrong, it has a limited impact on the denoising, so I was not going to do much about it. The solution otherwise requires a full extra copy of the image in memory, which I decided that was too costly for the little effect it had.

However, this should have no effect in the end-slides I think. I believe that the end-slides is a problem of the boundary condition. This code assumes Neumann boundary conditions (i.e. gradient is zero in the slice direction at the first/last slice). This means that the gradient at the edges will always be smaller than everywhere else, and so will it be then the denoising strength. Perhaps using periodic boundary conditions would fix the issue (if this is the issue). Critically, it would mean to change these lines to access the next (rather than previous) pixel value in the else case:

https://github.com/CERN/TIGRE/blob/master/Common/CUDA/tvdenoising.cu#L70-L123

Yes, I agree that the second problem is not the cause of ghost!the main reason is the first problem that have mentioned before. here is the flow chart of the main problem; I also have tested the new code, and have some result, you can have a look in github issues page(https://github.com/CERN/TIGRE/issues/594). fb723e4938ebf5af5d0c7b2bf75c8eb

AnderBiguri commented 1 month ago

I just updated TIGRE to solve the ghosting issue. Does the change (its from your code) solve it?

zezisme commented 1 month ago

I just updated TIGRE to solve the ghosting issue. Does the change (its from your code) solve it?

And for the boundary buf expansion method, I have a new method to solve the noise enhancement problem. Here is the new code (just copy the boundary slice as the buf slice to avoid the negative z-axis gradient caused by the mirror slice) for https://github.com/CERN/TIGRE/blob/cf427822287de64f6201bd70ad9d689f02616813/Common/CUDA/tvdenoising.cu#L402-L443 The fellowing change may just work in single GPU, If you use multiple GPUs, it may go wrong!

for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_u[dev]+pixels_per_slice*j, h_u+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_u[dev]+bytes_device[dev]+pixels_per_slice*j, h_u+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        cudaMemcpyAsync(d_u [dev] +offset_device[dev], h_u +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);

                    }

                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+2]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_px[dev]+pixels_per_slice*j, h_px+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_px[dev]+bytes_device[dev]+pixels_per_slice*j, h_px+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                            }
                        }
                        cudaMemcpyAsync(d_px[dev]+offset_device[dev], h_px+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);

                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+3]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_py[dev]+pixels_per_slice*j, h_py+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_py[dev]+bytes_device[dev]+pixels_per_slice*j, h_py+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                            }
                        }
                        cudaMemcpyAsync(d_py[dev] +offset_device[dev], h_py+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);

                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+4]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_pz[dev]+pixels_per_slice*j, h_pz+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);

                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_pz[dev]+bytes_device[dev]+pixels_per_slice*j, h_pz+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                            }
                        }
                        cudaMemcpyAsync(d_pz[dev] +offset_device[dev], h_pz+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);

                    } 
                    for (dev = 0; dev < deviceCount; dev++){  
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+pixels_per_slice*j, src+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+bytes_device[dev]+pixels_per_slice*j, src+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);

                    }
zezisme commented 1 month ago

I just updated TIGRE to solve the ghosting issue. Does the change (its from your code) solve it?

I have read your modification, which can indeed solve the ghost problem. However, for the problem of assuming that the boundary gradient is 0(Neumann boundary conditions), this will make the boundary slice noise larger. But this may not be a very important problem

AnderBiguri commented 1 month ago

Hi @zezisme. I only added the memory zeroing part, not the part that you mentioned that is only for one GPU.

For now, you shared 3 pieces of code:

So, questions for you:

  1. Given just the first change, does this fix your ghosting?
  2. The second piece of code, as we discussed, would indeed fix the fact that different splits don't access exactly the same data. However, do you see a big difference if you don't do anything about it?
  3. Is the third piece of code needed?
zezisme commented 1 month ago

Hi @zezisme. I only added the memory zeroing part, not the part that you mentioned that is only for one GPU.

For now, you shared 3 pieces of code:

So, questions for you:

  1. Given just the first change, does this fix your ghosting?
  2. The second piece of code, as we discussed, would indeed fix the fact that different splits don't access exactly the same data. However, do you see a big difference if you don't do anything about it?
  3. Is the third piece of code needed?

Answer three questions:

  1. the first change can indeed solve the ghost problem!
  2. Sorry for my unclear expression. All the other codes are to solve the problem of boundary slice noise (due to Neumann boundary conditions), not for the problem2 in https://github.com/CERN/TIGRE/issues/594#issuecomment-2423776348.(I don't think this is an important issue either.)
  3. the third piece of code is not needed in Neumann boundary conditions.
AnderBiguri commented 1 month ago

@zezisme Thanks a lot for the explanation!

I think the best way to tackle the fact that the edge slices have different strength is to change the math, not the memory.

Can you see if we change the lines I linked before to the following fixed the problem? Essentially this changes the boundary conditions from Neumman to mirror.

   __device__ __inline__
            float divergence(const float* pz, const float* py, const float* px,
            long z, long y, long x, long depth, long rows, long cols,
            float dz, float dy, float dx)
    {
        long size2d = rows*cols;
        long idx = z * size2d + y * cols + x;
        float _div = 0.0f;

        if ( z - 1 >= 0 ) {
            _div += (pz[idx] - pz[(z-1)*size2d + y*cols + x]) / dz;
        } else {
            _div += (pz[idx]-pz[(z+1)*size2d + y*cols + x]) / dz;
        }

        if ( y - 1 >= 0 ) {
            _div += (py[idx] - py[z*size2d + (y-1)*cols + x]) / dy;
        } else {
            _div += (py[idx] - py[z*size2d + (y+1)*cols + x]) / dy;
        }

        if ( x - 1 >= 0 ) {
            _div += (px[idx] - px[z*size2d + y*cols + (x-1)]) / dx;
        } else {
            _div += (px[idx] - px[z*size2d + y*cols + (x+1)]) / dx;
        }

        return _div;
    }

    __device__ __inline__
            void gradient(const float* u, float* grad,
            long z, long y, long x,
            long depth, long rows, long cols,
            float dz, float dy, float dx)
    {
        long size2d = rows*cols;
        long idx = z * size2d + y * cols + x;

        float uidx = u[idx];

        if ( z + 1 < depth ) {
            grad[0] = (u[(z+1)*size2d + y*cols + x] - uidx) / dz;
        }else{
            grad[0] = (u[(z-1)*size2d + y*cols + x] - uidx) / dz;
        }

        if ( y + 1 < rows ) {
            grad[1] = (u[z*size2d + (y+1)*cols + x] - uidx) / dy;
        }else{
            grad[0] = (u[z*size2d + (y-1)*cols + x] - uidx) / dy;
        }

        if ( x + 1 < cols ) {
            grad[2] = (u[z*size2d + y*cols + (x+1)] - uidx) / dx;
        }else{
            grad[0] =  (u[z*size2d + y*cols + (x-1)] - uidx) / dx;
        }
    }
zezisme commented 1 month ago

@zezisme Thanks a lot for the explanation!

I think the best way to tackle the fact that the edge slices have different strength is to change the math, not the memory.

Can you see if we change the lines I linked before to the following fixed the problem? Essentially this changes the boundary conditions from Neumman to mirror.

   __device__ __inline__
            float divergence(const float* pz, const float* py, const float* px,
            long z, long y, long x, long depth, long rows, long cols,
            float dz, float dy, float dx)
    {
        long size2d = rows*cols;
        long idx = z * size2d + y * cols + x;
        float _div = 0.0f;

        if ( z - 1 >= 0 ) {
            _div += (pz[idx] - pz[(z-1)*size2d + y*cols + x]) / dz;
        } else {
            _div += (pz[idx]-pz[(z+1)*size2d + y*cols + x]) / dz;
        }

        if ( y - 1 >= 0 ) {
            _div += (py[idx] - py[z*size2d + (y-1)*cols + x]) / dy;
        } else {
            _div += (py[idx] - py[z*size2d + (y+1)*cols + x]) / dy;
        }

        if ( x - 1 >= 0 ) {
            _div += (px[idx] - px[z*size2d + y*cols + (x-1)]) / dx;
        } else {
            _div += (px[idx] - px[z*size2d + y*cols + (x+1)]) / dx;
        }

        return _div;
    }

    __device__ __inline__
            void gradient(const float* u, float* grad,
            long z, long y, long x,
            long depth, long rows, long cols,
            float dz, float dy, float dx)
    {
        long size2d = rows*cols;
        long idx = z * size2d + y * cols + x;

        float uidx = u[idx];

        if ( z + 1 < depth ) {
            grad[0] = (u[(z+1)*size2d + y*cols + x] - uidx) / dz;
        }else{
            grad[0] = (u[(z-1)*size2d + y*cols + x] - uidx) / dz;
        }

        if ( y + 1 < rows ) {
            grad[1] = (u[z*size2d + (y+1)*cols + x] - uidx) / dy;
        }else{
            grad[0] = (u[z*size2d + (y-1)*cols + x] - uidx) / dy;
        }

        if ( x + 1 < cols ) {
            grad[2] = (u[z*size2d + y*cols + (x+1)] - uidx) / dx;
        }else{
            grad[0] =  (u[z*size2d + y*cols + (x-1)] - uidx) / dx;
        }
    }

Hello @AnderBiguri This doesn't seem to work, there is still strong noise at the boundary slice. I guess it is because of the existence of buffer that the actual curr boundary slice is not detected

AnderBiguri commented 1 month ago

Ah you may be right... Let met think about it a little more. Indeed I think we always call the kernels with a z value of size_slice+buffer*2 which perhaps we should not do for edge slices (first and last).

zezisme commented 1 month ago

Ah you may be right... Let met think about it a little more. Indeed I think we always call the kernels with a z value of size_slice+buffer*2 which perhaps we should not do for edge slices (first and last).

Yes, do you have a better solution now? I have tried passing a parameter to the kernels to identify the real boundary slice, but I meet a new problem, and I am a little confused now.

AnderBiguri commented 1 month ago

@zezisme maybe your solution is the best, but I need some time to test and play with multi-GPU cases to ensure its right.

An alternative for now is to add some slices to the input image to im3Ddenoise() and remove them after the GPU call. Computationally more expensive but should do the trick if you need results soon. Otherwise its possible that what you are doing (filling the buffer with edge slices) is the right way to go.

zezisme commented 1 month ago

@zezisme maybe your solution is the best, but I need some time to test and play with multi-GPU cases to ensure its right.

An alternative for now is to add some slices to the input image to im3Ddenoise() and remove them after the GPU call. Computationally more expensive but should do the trick if you need results soon. Otherwise its possible that what you are doing (filling the buffer with edge slices) is the right way to go.

OK,thank you very much!I will choose the solution that adding some slices to the image before TVdenoising. And look forward to your subsequent more effective solutions!

AnderBiguri commented 4 weeks ago

hi @zezisme I will be quite busy the next weeks, so if you want to try to produce a nice piece of code that would also be multi-GPU compatible to fix the last error, feel free to give it a shot and make a Pull Request.

If not, no worries, I'll try to tackle this later on. Just that I won't have time for it in the next few weeks.

AnderBiguri commented 2 weeks ago

I made a PR, can you test it @zezisme to see if it fixes the issue? I don't have a computer with multiple GPUs and enough RAM to test it right now. https://github.com/CERN/TIGRE/pull/609