The reason why distillation is necessary

That's an interesting question. My hypothesis is that smoothness priors of an MLP are helpful for avoiding bad local minima. An 8-layer MLP with 256 hidden units might simply have better priors than a small 4-layer MLP with 32 hidden units. The second thing that might be at play here is that for KiloNeRF any smoothness prior is not effective across regions belonging to individual networks. In order to verify that the smaller size of the MLP is not the root cause for these issues, one could see what happens when training with a high number of big MLPs. If the artifacts persist then it's more probable that the independence of the network causes these problems (and not the different network architecture). It would be also interesting to see if the artifacts are indeed more likely to appear at boundaries. In that case, it might help to overlap individual networks slightly and interpolate outputs when querying the representation close to the boundary (a popular technique for avoiding discretization artifacts).

Anyhow, I think there are much cheaper ways for bootstrapping. Recently, it was shown that a voxel grid can be optimized directly (takes < 5 minute) and used to bootstrap a feature-conditioned MLP. Instead of the feature-conditioned MLP one could use KiloNeRF.

creiser / kilonerf

The reason why distillation is necessary #15