SART crashes when running on more than 2 GPU nodes

xliu68 commented 5 years ago

Expected Behavior

I was running SART recon on a HPC cluster (CentOS system). I can specify the number of CPU and GPU nodes that I use.

Actual Behavior

When I run SART with more than 2 GPU nodes (e.g. 4 or 8), MATLAB crashes and gives me either a "segmentation fault" (invalid next size) or a "double free" error. It also shows "Picked up _JAVA_OPTIONS: -Xmx256m". This has not been observed with FDK and OS-SART. Reducing the number of GPU nodes to 2 avoids the crashes for now. I am not sure whether this behavior is expected.

Code to reproduce the problem (If applicable)

geo=defaultGeometry('nVoxel',[128;128;128]);                     
angles=linspace(0,2*pi,100);
head=headPhantom(geo.nVoxel);
projections=Ax(head,geo,angles,'interpolated');
img = SART(projections,geo,angles,1);

Specifications

MATLAB version: 2019a
OS: CentOS 7.3
CUDA version: 9.2.148.1

AnderBiguri commented 5 years ago

@xliu68 Thanks for the report! I have limited access to multi-GPU machines with more than 2 nodes, I will however put this on my priority list. Does this only happen with SART?

xliu68 commented 5 years ago

I have only observed this in SART so far. Haven't tested in all algorithms.

bnel1201 commented 4 years ago

I was experiencing this same issue as well with my 4-gpu box and I think I can narrow down the problem a bit more. The issue seems to be with the forward projector Ax.m and the corresponding cuda calls to interpolation_projection in ray_interpolated_projection.cu. I get a memory access error anytime I try to forward-project a number angles of less than my GPU count - 1. I'm not sure where the -1 comes from, with my 4-gpu setup I get access errors when forward project 1 and 2 angles, but not with 3 or more.

SART brings this problem up because it forward projects a single projection angle at a time, so my work-around for my multi-gpu case is to use OS-SART where a batch of angles are used.

The code below reproduces the source of the problem for my 4-gpu box: angles=linspace(0,2*pi, gpuDeviceCount-2); head=headPhantom(geo.nVoxel); projections=Ax(head,geo,angles,'interpolated');

Some solutions I could see would be adding a warning to SART if gpuDeviceCount > 2 and pointing the user to OS_SART or digging into the cuda file to deal with the rare case when the number of projection angles are less than the number of GPU nodes

AnderBiguri commented 4 years ago

@bnel1201 The problem must be that I am freeing a non-allocated pointer in the forward projector... Runtime errors make the code crash. The logic behind the partitioning is a bit obfuscated, so its even hard for me to sit down and have a look at it. I may have some free time August 2020 so I will try to see if I can spot the mistake. Your debugging example will certainly help, thanks!

AnderBiguri commented 4 years ago

The error must be either here:

https://github.com/CERN/TIGRE/blob/master/MATLAB/Source/ray_interpolated_projection.cu#L459-L462

or here

https://github.com/CERN/TIGRE/blob/master/MATLAB/Source/ray_interpolated_projection.cu#L478-L487

Or both.

There is a logic to handle whenever the number of angles is not divisible by number of GPUs, that basically assigns the last GPU less angles (or zero) to project, and its treated as an special case. This works fine when nangles=100, but if there are very little angles, enough to leave 2 GPUs without a job, then it crashes, because it only checks the last GPU as a special case. The entire logic of dealing with this must be changed.... I need to think about how.

bnel1201 commented 4 years ago

Those do seem like likely culprits. Thanks for looking into this!

AnderBiguri commented 4 years ago

@bnel1201 https://github.com/CERN/TIGRE/pull/175 tries to solve this. I have only changed the interpolated version, but I suspect the ray-voxel projection has the same issue. I only have access to a 1 GPU machine, can you test this?

bnel1201 commented 4 years ago

@bnel1201 https://github.com/CERN/TIGRE/pull/175 tries to solve this. I have only changed the interpolated version, but I suspect the ray-voxel projection has the same issue. I only have access to a 1 GPU machine, can you test this?

Yes, I'll have access to the multi-gpu machine this weekend. I'll let you know how things go!

bnel1201 commented 4 years ago

Progress! Yep Ax_mex with 'interpolated' ptype now works for my 4 gpu box! You are also correct that ray-voxel projection has the same issue.

To summarize for my 4 gpu box:

Single interpolated projection now works:

geo=defaultGeometry('nVoxel',[128;128;128]);                     
angles=linspace(0,2*pi, 1);
head=headPhantom(geo.nVoxel);
projections=Ax(head,geo,angles,'interpolated');

However a single ray-voxel projection has a memory access violation like before

geo=defaultGeometry('nVoxel',[128;128;128]);                     
angles=linspace(0,2*pi,1);
head=headPhantom(geo.nVoxel);
projections=Ax(head,geo,angles,'ray-voxel');

Thus if you can make a similar fix to siddon_ray_projection in Siddon_projection.cu that should do it and I'll be happy to retest to confirm!

AnderBiguri commented 4 years ago

Ah fantastic news! I hope I haven't added new bogs for another set of GPUs, but it doesn't seem so. I'll fix this ASAP!

AnderBiguri commented 4 years ago

@bnel1201 done (in the branch)!

Can you also try something like geo=defaultGeometry('nVoxel',[2000;2000;2000]); if your RAM allows?

bnel1201 commented 4 years ago

Everything works! Great job! I ran demo d07_Algorithms02 from start to finish both with

geo=defaultGeometry('nVoxel',[128;128;128]); and
geo=defaultGeometry('nVoxel',[1000;1000;1000);

and both ran start to end on my 4 gpu box with no problems!

I couldn't generate head=headPhantom([2000; 2000; 2000]); because matlab's interp3 ran out of memory but nVoxels = 1250^3 seemed to be my memory bottleneck for my GPUs and that worked as well! I'm working at getting access to a machine with more memory but as far as I can tell this fix seems to get the job done!

AnderBiguri commented 4 years ago

Ah,great news! just to make sure instead of running 2000^3 with head, can you run it by generating a white image with ones? That would skip the interp memory issue and properly test the upper bound. I just want to be sure the image is split into multiple pieces per gpu

bnel1201 commented 4 years ago

I'm sorry I couldn't get back to you on this sooner! Our multi-GPU machine was tied up last week. I ran your test and get the following when using 2000^3 array of single-type ones

Stream creation fail 
Error using Ax_mex
out of memory

Error in Ax (line 59)
projections=Ax_mex(img,geo,angles,ptype);

Error in d07_Algorithms02 (line 38)
projections=Ax(head,geo,angles,'interpolated');

The machine uses 4 Quadro M6000 of 12 gb each.

I don't know how many memory allocations are made in the cuda code so I can't say whether this 'out of memory' call is expected given the amount of resources available.

However when I scale down to 1200^3 everything runs as expected and when I look at system resource usage I see all GPUs being used

I see you've merge with master which is great! I'll keep testing with my data and post any issues if anything comes up. Thanks again for your effort in getting this working!

AnderBiguri commented 4 years ago

@bnel1201 humm, I'll try to investigate that more, its not an expected error.... Unless someone else was using part of the gpu

bnel1201 commented 4 years ago

It is a shared machine with users remoting in, however according to Task Manager I'm the only one using resources at the moment.

AnderBiguri commented 4 years ago

@bnel1201 I have pushed a couple of extra error checks to the main code. Could you try again? 2000^3 should definitely work on your 4 GPU system

bnel1201 commented 4 years ago

Hi again! Sorry for the delay, I'm working on getting access to another multi-gpu machine so I can test things out more frequently.

Another strange issue I ran into with the same 4 gpu machine described above was when I tried forward projecting with the following angles: angles=linspace(0,2*pi, N*180 + 1); or angles=linspace(0,2*pi, N*180 + 2); For N = 1, 2, 3, 4... (N = 0 works and so does N *180 + 3, N*180+4...) I get a similar memory access violation as described above

Here's the exact code:
%% Define Geometry
geo=defaultGeometry('nVoxel',[128;128;128]);                     

%% Load data and generate projections 
% see previous demo for explanation
N = 1;
angles=linspace(0,2*pi, N*180 + 1);
head=headPhantom(geo.nVoxel);
projections=Ax(head,geo,angles,'interpolated');
noise_projections=addCTnoise(projections);

This seems like a similar issue of angles not being shared properly between GPUs.

AnderBiguri commented 4 years ago

@bnel1201 hmm I may have broken more things than what I fixed then! I should not have prematurely closed this.

I need to figure out of I can get access to a machine with multiple gpus myself, otherwise I can't be pushing stuff withouth debugging. Thanks for letting me know, I'll see if I can fix it ASAP!

bnel1201 commented 4 years ago

Sounds good! Sorry I should have tested this myself sooner, but I was excited when I was able to run the SART demos to the end. I only bumped into this when trying out my own data which often contains 361, 721, or 1441 projections.

I also tried the 2000^3 experiment (shown below) and still get the

Error using Ax_mex
out of memory

error.

geo=defaultGeometry('nVoxel',[2000;2000;2000]); 
angles=linspace(0,2*pi, 1);
% head=headPhantom(geo.nVoxel);
head=ones(geo.nVoxel', 'single');
projections=Ax(head,geo,angles,'interpolated');
noise_projections=addCTnoise(projections);

Best of luck! Let me know if there's anything else you want me to try until you get access to another machine.

genusn commented 3 years ago

I tried to reproduce this error on my heterogeneous GPUs (2x2080 + 1070). In ray_interpolated_projection.cu, comment out the GPU name check lines and added

*mem_GPU_global = 500*1000*1000;

at the end of the function checkFreeMemory(). Running the MATLAB script

clear;
close all;
sizeVol = 256;  % 512;
geo=defaultGeometry('nVoxel',[sizeVol;sizeVol;sizeVol]);
head=ones(geo.nVoxel', 'single');
for k=1:100  % die at 28. Access Violation 
    disp(k);
    angles=linspace(0,2*pi, k);
    projections=Ax(head,geo,angles,'interpolated');
end

MATLAB always crashes at k=28 with

Abnormal termination:
Access violation

Here, I have confirmed that splits = 1 if sizeVol = 256 and splits = 2 if sizeVol = 512. Unlike the case of #186, the problem seems to occur regardless of the value of splits.

Setting deviceCount = 1, 2, 3 in interpolation_projection() in ray_interpolated_projection.cu, I found that the crash occurs only when deviceCount = 3.

I guess that I have reproduced this issue. I report this because I think that "k=28" is interesting.

P.S. The trials are based on the current master.

AnderBiguri commented 3 years ago

@genusn thanks, this helps a lot! Since this issue, I did attempt to fix it, but I do not have access to a dual gpu machinr, so it's hard for me to figure out what may be wrong. Any test helps to try to pinpoint the issue.

I am quite sure that the issue is related to a bad memfree somewhere at certain conditions, but I can't seem to figure it out where...

genusn commented 3 years ago

I ported the MATLAB ray_interpolated_projection.cu to Python to accerelate this investigation. (Matlab IDE crashes when mex crashes, but Visual studio code with Python does not crash when TIGRE crashes.) I also wrote Python version of the script above:

volSize = 128
geo = tigre.geometry(mode='cone', nVoxel=np.array([volSize,volSize,volSize]),default=True)
geo.dDetector = np.array([0.8, 0.8])*1               # size of each pixel            (mm)
geo.sDetector = geo.dDetector * geo.nDetector
head=np.ones(geo.nVoxel, dtype=np.float32)
for nangles in range(1, 100):
    print("nangles = {}".format(nangles))
    angles = np.linspace(0, 2 * np.pi, nangles, dtype=np.float32)
    proj = tigre.Ax(head,geo,angles, krylov="interpolated")

After checking that the python version crashes at the same condition as the MATLAB version (I could not check that it is definetely due to the access violation though), I tried changing some parameters and found that

it crashes when nangles = N PROJ_PER_BLOCK 3+1 (N=1,2,3...) and the parameters such as the assigned memory (splits), PROJ_PER_BLOCK, volSize do not affect the result.
it crashes at https://github.com/CERN/TIGRE/blob/dcca52c56c020f8f5ffc2e6e3ea30dbce3d21423/MATLAB/Source/ray_interpolated_projection.cu#L470

Reading the code, I came to think that the following comment and lines are suspicious: https://github.com/CERN/TIGRE/blob/dcca52c56c020f8f5ffc2e6e3ea30dbce3d21423/MATLAB/Source/ray_interpolated_projection.cu#L468-L470

The number of projections assigned to the last device is >=1 and < nangles_device in general. When nangles=28 and deviceCount =3, the number of angles per device, nangles_device = 10, and 10, 10 and 8 projections are assigned to the device 0, 1, 2 respectively. Because PROJ_PER_BLOCK = 9, the number of angle sets, noOfKernelCalls = 2. After the first kernel call, i.e. i=1, the number of the projections in the previous set is 10 for device 0 and 1, but 8 for device 2. Therefore, copying 10 projections on the device 2 causes access violation at L470.

I replaced L470 with

if(dev+1==deviceCount) {    //is it the last device?
    // projections assigned to this device is >=1 and < nangles_device
    projection_this_block=min(PROJ_PER_BLOCK, nangles-proj_global);
} else {
    projection_this_block=PROJ_PER_BLOCK;
}

then the above program no longer crashes. I have not yet checked the result image. I think that it is not enough for the case that deviceCount > PROJ_PER_BLOCK. I have no access to such a luxury, gorgeous computer though. :smile:

AnderBiguri commented 3 years ago

@genusn ah this is fantastic news! And I am not sure we need to worry for deviceCount > PROJ_PER_BLOCK, I am not sure its even possible with nowadays computers!

Are you going to make a PR for this, or should i?

genusn commented 3 years ago

@AnderBiguri Thank you for your comments. Let me make a PR. I want to think a little bit more about the arguments of min function.

CERN / TIGRE