april-tools / cirkit

a python framework to build, learn and reason about probabilistic circuits and tensor networks
https://cirkit-docs.readthedocs.io/en/latest/
GNU General Public License v3.0
71 stars 1 forks source link

Refactor book-keeping #96

Closed loreloc closed 1 year ago

loreloc commented 1 year ago
lkct commented 1 year ago

NOTE: this also involves refactoring of layers to accept a unified input shape

arranger1044 commented 1 year ago

let's first benchmark a full forward/backward pass to see how we stand now

loreloc commented 1 year ago

At the current stage (968102937a9b26fec6255fe2d46cf7d8095b8c8c) I got the following results for forward+backward w/ CP using the command

python run_cirkit.py --batch_size 128 --num_latents 512 --region_graph ../quad_tree_28x28.json
t/m: 590.2905883789062 4227.39599609375                                                                           
t/m: 194.1531524658203 5020.56884765625                                                                           
t/m: 193.1691131591797 5020.56884765625                                                                           
t/m: 195.10546875 5020.56884765625                                                                                
t/m: 194.8521270751953 5020.56884765625                                                                           
t/m: 194.46080017089844 5020.56884765625                                                                          
t/m: 194.84063720703125 5020.56884765625
...
train LL: 3834.2465698242186  (I guess this is NLL)

Apart from the first warm-up batch it seems we stand at 0.2s/5.0GiB.

arranger1044 commented 1 year ago

nice, this is before changing the order of tensors, right?

lkct commented 1 year ago

(I guess this is NLL)

You're right. pyjuice uses LL for EM. cirkit uses NLL for SGD currently. I didn't change the print prompt.

PS: I directly edited to turn the commit hash into a hyperlink.

nice, this is before changing the order of tensors, right?

Yes. that commit is on 3 Jul, when we have yet to decide which order to use. The result seems consistent with the one I got immediately after the performance fix.

arranger1044 commented 1 year ago

my understanding was we shall go with FBK for now (and the element wise prod for mixing layers)

loreloc commented 1 year ago

TL;DR: it should be the responsibility of the network, not the layers, to deal with issues related to folding.

Regarding that element-wise product with a mask in mixing layers: initially I thought it would be better than padding. However, later I realised that the same would be required for any layer, as we are interesting in folding layers with possibly different "fan_in" (or "arity"). If think it would make the integration of new layers more complicated, hence the extension of the library itself.

Therefore, I opted instead for keeping the padding information in the book-keeping data structure and removing the existing mask in the mixing layer. It's responsibility of the network to pad inputs (if necessary) before passing them to the "folded" layers. If the pad values are chosen accordingly to whether space we are performing computations on, then it would take care of masking the gradients in an appropriate way.

With this in mind, having an "un-folded" but tensorized circuit implementation is quite straightforward, as it boils down to not have any padding and instantiating a layer for each region, while leaving the actual implementation of layers and the forward loop untouched.

I think this is reasonable design, although we need to benchmark things. Feel free to add other ideas about this.