Closed loreloc closed 1 year ago
NOTE: this also involves refactoring of layers to accept a unified input shape
let's first benchmark a full forward/backward pass to see how we stand now
At the current stage (968102937a9b26fec6255fe2d46cf7d8095b8c8c) I got the following results for forward+backward w/ CP using the command
python run_cirkit.py --batch_size 128 --num_latents 512 --region_graph ../quad_tree_28x28.json
t/m: 590.2905883789062 4227.39599609375
t/m: 194.1531524658203 5020.56884765625
t/m: 193.1691131591797 5020.56884765625
t/m: 195.10546875 5020.56884765625
t/m: 194.8521270751953 5020.56884765625
t/m: 194.46080017089844 5020.56884765625
t/m: 194.84063720703125 5020.56884765625
...
train LL: 3834.2465698242186 (I guess this is NLL)
Apart from the first warm-up batch it seems we stand at 0.2s/5.0GiB.
nice, this is before changing the order of tensors, right?
(I guess this is NLL)
You're right. pyjuice uses LL for EM. cirkit uses NLL for SGD currently. I didn't change the print prompt.
PS: I directly edited to turn the commit hash into a hyperlink.
nice, this is before changing the order of tensors, right?
Yes. that commit is on 3 Jul, when we have yet to decide which order to use. The result seems consistent with the one I got immediately after the performance fix.
my understanding was we shall go with FBK for now (and the element wise prod for mixing layers)
TL;DR: it should be the responsibility of the network, not the layers, to deal with issues related to folding.
Regarding that element-wise product with a mask in mixing layers: initially I thought it would be better than padding. However, later I realised that the same would be required for any layer, as we are interesting in folding layers with possibly different "fan_in" (or "arity"). If think it would make the integration of new layers more complicated, hence the extension of the library itself.
Therefore, I opted instead for keeping the padding information in the book-keeping data structure and removing the existing mask in the mixing layer. It's responsibility of the network to pad inputs (if necessary) before passing them to the "folded" layers. If the pad values are chosen accordingly to whether space we are performing computations on, then it would take care of masking the gradients in an appropriate way.
With this in mind, having an "un-folded" but tensorized circuit implementation is quite straightforward, as it boils down to not have any padding and instantiating a layer for each region, while leaving the actual implementation of layers and the forward loop untouched.
I think this is reasonable design, although we need to benchmark things. Feel free to add other ideas about this.