LLNL / LEAP

comprehensive library of 3D transmission Computed Tomography (CT) algorithms with Python and C++ APIs and fully integrated with PyTorch
https://leapct.readthedocs.io
MIT License
99 stars 10 forks source link

Reconstruction Speed Issue of LEAP #99

Open winguphehe opened 2 weeks ago

winguphehe commented 2 weeks ago

I am a new user of LLNL/LEAP. And I encountered the speed issue. I use it for cone beam CT reconstruction of 2048 projections with image size 3072 3072, the output volume size is 3072 3072 * 3072. I use a NVIDIA 3090 with 24GB GPU memory. The computer has 128GB memory. While the speed is relatively slow. The computation consumes almost 8000 seconds. My question is, can I accelerate the calculation and how? Thanks a lot!

1) how many chunks should I make to calculate for many steps based on the GPU and CPU memory configuration. 2) Can I write a C++ program to call the functions directly to accelerate the execution because python language is relatively slow 3) Which algorithm is fastest, FBP or SIRT or some algorithm else? 4) Some times the iterative algorithm will make the computer dead and How do I deal with it? 5) is there any iterative algorithm that run online when scanning is in process to accelerate the overall speed? 6) the chunkated FBP_gpu will run out of gpu memory very easily even when I set the number of chunks to be 128.

    #initialization code is not listed. 

    device = torch.device("cuda:" + str(leapct.get_gpu()))

    g_chunk_gpu = torch.from_numpy(g_chunk).to(device)
    f_chunk_gpu = torch.empty(f_chunk_cpu.shape, dtype=torch.from_numpy(f_chunk_cpu).dtype, device=device)

    f_chunk_gpu = leapct_chunk.FBP_gpu(g_chunk_gpu, f_chunk_gpu)

    f_chunk_cpu = f_chunk_gpu.cpu().detach().numpy()
    del g_chunk_gpu
    del f_chunk_gpu
    torch.cuda.empty_cache()
    torch.cuda.empty_cache()
kylechampley commented 2 weeks ago

I just got two comments that seems like some kind of spam attacks, so I deleted them. @winguphehe I'll respond to you post tomorrow. Thanks for posting it and I think I can help resolve your issue, but it is late, so I'll respond later.

kylechampley commented 2 weeks ago

It depends on a few factors, but FBP is 10-1000 times faster than iterative reconstruction. Also, iterative reconstruction requires somewhere between 2 and 5 times more memory. Your data is very large and since you have so many projections, it seems like using FBP is the best place to start. Your iterative reconstruction is likely hanging because it is trying to use too much memory.

Your projection data takes 72 GB of memory and your volume takes 108 GB, so in total that's 180 GB which is more memory than you have on your system. Your computer may allow you to perform an FBP reconstruction by paging some of the memory to your disk drive, but this will cause an enormous slow down. Your computer cannot predict when it needs certain pieces of the data so it is constantly fetching data off the disk and ditching it when it realizes it needs different data. This is likely the reason why the reconstruction is taking so long.

I would not use the FBP_gpu command. I think it confuses people and if it weren't for backward compatibility, I would remove this function. Just use the normal FBP command; by default it will use all GPUs on your system. For really big reconstruction problems, don't copy the projection and volume data to the GPU. Let the internals of LEAP handle this as it will do it much more efficiently. Just feed the FBP function numpy arrays or torch tensors where the data is on the CPU.

Now one way to perform FBP reconstruction of a large volume is to follow what I did in this demo script: d13_cropping_subchunking.py. Look at the part of the code for "whichDemo == 2". This will work OK, except the fact that this demo script has one load the entire projection data into memory and process the reconstruction of the volume in smaller chunks. This is non-ideal because you projection data is so large. Instead, I recommend not loading the entire projection data into memory and replacing this line:

g_chunk=leapct_chunk.cropProjections(rowRange, None, g)

with a line that only reads the detector row range specified by the rowRange variable from file. Doing all of this should provide a very, very large speed improvement.

If you do it this way, things will be nearly ideal and most of the Python code will run just as fast as the C++/CUDA because it is barely doing anything. There is only one remaining speed improvement possible and that has to do with how the data files are loaded. I am assuming that your data is saved as a sequence of tiff files. Well, the Python tiff file readers don't provide (at least I don't know how to do it in Python) a method to just read a subset of the file. It one replaced this with some library that could do this, then that would be pretty much the ideal situation with respect to speed.

I hope this helps. Good luck!

hws203 commented 1 week ago

About the speed of reading those projection files, I tested it with one single raw file which include all projection, which can save almost 10 times more time than reading each separated projection files. I guess that windows OS also lose some time to get another next file even it is at the same folder in the case of file cached not running .

winguphehe commented 1 week ago

Yes, I have tested other iterative algorithms such as SIRT. The conclusion is that FBP is much faster and do not need any chunked processing. But SIRT needs chunked processing or else it will corrupt, but still the speed is rather slow and the result is no so good as FBP.

I also found that FBP_gpu is inconvenient to process. Even a very small chunked data needs a lot of GPU memory and the memory will soon run out before finish calculation.

I also agree Hws203's comment on image reading. The image reading from a series of tif files takes a long time, almost 16 minutes to load 2048 images. I will have a try to save the data in a single file to save time.

I have tried on subchunking approach to use FBP with only part of the data and I found that this approach do not save time when the over all volume size is set to be 1000 1000 1000 (not 3072 3072 3072).

But I don't know why VG is very fast in processing. Only half an hour is needed to perform reconstruction with the same size of the input images (image reading time is NOT included). It also runs on a computer with 128GB memory and a 3090 card.

I will try to avoid reading complete projections to g to test the speed and I will report the result soon.

winguphehe commented 1 week ago

well, I have tried to use chunkated reconstruction. The reconstruction time reduces at about 200 seconds when the volume is 1000 1000 1000. Numchunks = 64, reconstruction time 857s; 32, 831s; 16, 821s; 8, 1044s. Image load time is about 260~280s. I have implemented a multi-thread approach to perform reading and gpu processing independently. I have 2-3 thread to read data from file and 1 thread to perform gpu reconstruction. The overall time reduces to 680s, reading time is included. The multiple threaded approach reaches a bottle neck of data reading from file when chunksize is small. Maybe all the tiff files stored along row is a better choice for fast reading. For example, the first row of every projections is stored as one tiff file, and so on.

kylechampley commented 1 week ago

The numbers you post do not make sense; they are very, very slow. Are you sure, LEAP is using the GPUs? These reconstruction times look more like how long it would take using CPU processing. A 1000^3 reconstruction should only take a few seconds. Could you run the d99_speedTest.py and send me the reconstruction times?

If you send me your reconstruction script, I can take a look. You can also see what GPUs LEAP can see by running the leapct.print_parameters() function.

My computer is very old; it has two 1080 GPUs and the 1024^3 FBP reconstruction takes about 7-8 seconds.

winguphehe commented 1 week ago

the input is relatively large, 2048 3072 3072 images, and the output is 1000 1000 1000. 128GB memory and 3090 with 24GB GPU memory. The code is simple, just initialize the cone beam geometry and set the volume size and call FBP function.

winguphehe commented 1 week ago

======== CT Cone-Beam Geometry ======== number of angles: 2048 number of detector elements (rows, cols): 3072 x 3072 angular range: -360.000000 degrees detector pixel size: 0.139000 mm x 0.139000 mm center detector pixel: 1535.500000, 1535.500000 sod = 200.000000 mm sdd = 1200.000000 mm tau = -0.137000 mm

======== CT Volume ======== number of voxels (x, y, z): 1024 x 1024 x 1024 voxel size: 0.069500 mm x 0.069500 mm x 0.069500 mm FOV: [-35.583998, 35.583998] x [-35.583998, 35.583998] x [-35.583998, 35.583998]

======== Processing Settings ======== GPU processing on device 0 GPU with least amount of memory: 22.768555 GB

winguphehe commented 1 week ago

I have run the d99_speedTest.py in my computer and it consumes only 7 seconds on reconstruction, and the GPU is obviously working.

kylechampley commented 1 week ago

I see in the script you send me that you have this line:

leapct.set_conebeam(numAngles, numRows, numCols, pixelSize, pixelSize, centerRow, centerRow, leapct.setAngleArray(numAngles, -360.0 if clockwise else 360 ), sod, sdd, tau)

I see here that you have centerRow listed twice. The second of these should be centerCol. Could this type be the reason for your issue?

hws203 commented 1 week ago

I think that storage order of every row might not save time, because the total saving quantity of data is same and the access time of drive is same too. how about using ReadFromFileAsync() with overlapped option?

winguphehe commented 5 days ago

@kylechampley the centerRow and centerCol are both set at the center , and so centerCol == centerRow actually. This is not the issue. While, I have already found the issue. It is all because of the memory shortage. Even though the chunkated reconstruction, still need to consumes a lot of memory. I have made a multithreaded approach to solve the problem and I am able to reach a speed similar to the commercial software. The reconstruction consumes only 25 minutes, which is similar with VG.

kylechampley commented 3 days ago

That's greats news! I only mentioned the centerRow issue because you said that centerRow did not work. I was trying to find areas where a typo was made.

Recently, I automated the chunk size in the LEAP-UI-GUI. The user specifies how much memory LEAP is allowed to use and then it sets the chunk size, so that it won't use too much memory. This can be done through the leapctserver class or through the GUI. Currently this repo lacks documentation, but I'll be adding that soon.