HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 6 forks source link

Hierarchical network #52

Closed ClashLuke closed 2 years ago

ClashLuke commented 2 years ago

Dilated convolution improves convergence (per step) a bit: grafik Although it's also massively slower, so we might want to re-evaluate its context size: grafik For the same wall-time, dilated convolution underperforms the dense one: grafik

ClashLuke commented 2 years ago

Currently, the performance improvements are marginal: grafik One way this could be happening is that the model doesn't use the context of the depthwise block. To validate this is happening, I'll start another run without bottleneck block and QRNN.

ClashLuke commented 2 years ago

48 somewhat addresses the convolution issue using an explicit locality bias for convolutions. Note that it doesn't reduce the computation time. However, neither do dilated convolutions, so it seems fair.