Currently, the main work is separated into a tracing and a shading kernel. The shading kernel can be plagued by a good amount of divergence. Hence it could be beneficial to split it into many small kernels handling each case of hit. To do this it will be necessary to come up with a clever way of storing the samples between kernels. We cannot afford to use more memory than we already do.
Right now I am thinking on storing the samples in a per thread list basis where each thread could keep track on how many samples of each category it has. Additionally, I am thinking one could balance the lists with a small kernel in which each thread balances the workload in one warp. If this kernel can be made fast, it could be beneficial.
Note that with this, I think having multiple samples per pixel should be dropped as that gives 5% performance at best which is negligible in the context of rendering multiple samples and makes everything more complicated. This also implies that we can leave out the memory for the results, records and albedo buffers in the samples. This leaves samples small and helps with the current memory pressure.
Currently, the main work is separated into a tracing and a shading kernel. The shading kernel can be plagued by a good amount of divergence. Hence it could be beneficial to split it into many small kernels handling each case of hit. To do this it will be necessary to come up with a clever way of storing the samples between kernels. We cannot afford to use more memory than we already do.
Right now I am thinking on storing the samples in a per thread list basis where each thread could keep track on how many samples of each category it has. Additionally, I am thinking one could balance the lists with a small kernel in which each thread balances the workload in one warp. If this kernel can be made fast, it could be beneficial.
Note that with this, I think having multiple samples per pixel should be dropped as that gives 5% performance at best which is negligible in the context of rendering multiple samples and makes everything more complicated. This also implies that we can leave out the memory for the results, records and albedo buffers in the samples. This leaves samples small and helps with the current memory pressure.
This is kind of a different approach to #19.