VoxelNet acts on tensors of the size (volumes, voxels, muons, features) and as part of its graph construction expands these into (volumes, voxels, muons, muons, new features) before collapsing back to the original shape. Although the forward method runs a loop over the volumes, (so the actual shape is just (voxels, muons, features)), the memory consumption is still very high.
Potential solutions
Loop over voxels in the first part of the network
The first part of the network computes a muon representation per voxel (voxels, muon representation) and this computation is performed irrespective of the other voxels. Meaning that the muon reps. could be computed serially rather than in parallel. This reduces the memory consumption at the cost of processing time.
Compile parts of the network
PyTorch makes it "easy" to compile parts of the network in c++ and CUDA. According to Jan Kieseler this heavily reduces memory consumption and processing time at the cost of development time and model flexibility. He has sent me some examples, and I have also gone through the official PyTorch tutorial for writing and compiling kernels. The main difficulty is that the backwards pass to compute the gradients must also be written manually, and the optimality of the writing of this can have a heavy impact on performance: in my testing of PyTorch's examples, the backwards pass was actually slower when compiled, but the forwards pass was slightly quicker.
There are several parts of the GNN that care candidates for compilation:
When expanding to the (voxels, muons, muons, new features) tensor, what we actually want is (voxels, muons, k-nearest muons, new features), but we still have to compute the distances between all the muons. This kNN indexing could be compiled to reduce memory consumption.
(voxels, muons, k-nearest muons, new features) is then later collapsed by aggregating across the k-nearest muons into (voxels, muons, aggregate features). The whole kNN+aggregation could be compiled to save even more memory, at the cost of some model flexibility.
In the graph collapse stage where we convert (voxels, muons, aggregate features) to (voxels, muon representation), the muon features go through a self-attention step which internally computes a temporary (voxels, muons, muons) tensor. This could also be compiled to save memory.
Problem
VoxelNet acts on tensors of the size (volumes, voxels, muons, features) and as part of its graph construction expands these into (volumes, voxels, muons, muons, new features) before collapsing back to the original shape. Although the forward method runs a loop over the volumes, (so the actual shape is just (voxels, muons, features)), the memory consumption is still very high.
Potential solutions
Loop over voxels in the first part of the network
The first part of the network computes a muon representation per voxel (voxels, muon representation) and this computation is performed irrespective of the other voxels. Meaning that the muon reps. could be computed serially rather than in parallel. This reduces the memory consumption at the cost of processing time.
Compile parts of the network
PyTorch makes it "easy" to compile parts of the network in c++ and CUDA. According to Jan Kieseler this heavily reduces memory consumption and processing time at the cost of development time and model flexibility. He has sent me some examples, and I have also gone through the official PyTorch tutorial for writing and compiling kernels. The main difficulty is that the backwards pass to compute the gradients must also be written manually, and the optimality of the writing of this can have a heavy impact on performance: in my testing of PyTorch's examples, the backwards pass was actually slower when compiled, but the forwards pass was slightly quicker.
There are several parts of the GNN that care candidates for compilation: