aesara-devs / aesara

Aesara is a Python library for defining, optimizing, and efficiently evaluating mathematical expressions involving multi-dimensional arrays.
https://aesara.readthedocs.io
Other
1.18k stars 155 forks source link

Limit the use of `equal_computations` in `Scan` rewrites #778

Open brandonwillard opened 2 years ago

brandonwillard commented 2 years ago

The Scan rewrites make liberal use of equal_computations, which is fairly expensive. Let's see if we can limit or—better yet—remove the need for equal_computations.

rlouf commented 2 years ago

Running the test with aesara.config.optimizer_verbose==True for a couple of minutes, I find in a log that contains about 21k lines:

Are you sure most of the time is spent in the scan rewrites?

brandonwillard commented 2 years ago

I ran the optimizer profiler for the aesara.function step in tests.test_hmc.test_warmup on https://github.com/brandonwillard/aehmc/commit/4248379 (i.e. config.change_flags(profile=True, profile_optimizer=True)).

Profiler output ```python Function profiling ================== Message: /home/bwillard/projects/code/python/aehmc/tests/test_hmc.py:67 Time in 0 calls to Function.__call__: 0.000000e+00s Total compile time: 2.779323e+02s Number of Apply nodes: 75 Aesara Optimizer time: 3.835249e+01s Aesara validate time: 2.275324e-02s Aesara Linker time (includes C, CUDA code generation/compiling): 239.55823922157288s Import time 7.904582e-01s Node make_thunk time 2.395537e+02s Node forall_inplace,cpu,scan_fn}(TensorConstant{2000}, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), , IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0) time 2.181189e+02s Node forall_inplace,cpu,scan_fn}(TensorConstant{1000}, TensorConstant{[ True Tr..rue True]}, TensorConstant{[False Fal..lse False]}, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, DeepCopyOp.0, DeepCopyOp.0, IncSubtensor{InplaceSet;:int64:}.0, RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0) time 1.069709e+01s Node do_while{inplace{1,2,},cpu,scan_fn}(TensorConstant{100}, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, DeepCopyOp.0, RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), RandomGeneratorSharedVariable(), , IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Elemwise{Composite{(-(i0 + (i1 * sqr(i2))))}}[(0, 2)].0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0) time 1.026223e+01s Node DeepCopyOp(IncSubtensor{Set;:int64:}.0) time 4.013307e-01s Node Elemwise{Composite{(i0 + (i1 * i2))}}(TensorConstant{-0.5}, TensorConstant{0.5}, ) time 9.969950e-03s Time in all call to aesara.grad() 3.874266e-01s Time since aesara import 380.904s Optimizer Profile ----------------- SeqOptimizer OPT_FAST_RUN time 38.352s for 198/75 nodes before/after optimization 23.301s for callback 0.023s for fgraph.validate() callbacks_time , 23.106433391571045 , 0.11796212196350098 , 0.012229681015014648 , 0.009456157684326172 , 0.006871700286865234 , 0.006870269775390625 , 0.003031492233276367 Updater{canonicalize} , 0.0023703575134277344 , 0.0007693767547607422 , 0.0004673004150390625 Updater{local_inplace_setsubtensor} , 0.0001442432403564453 Updater{canonicalize} , 4.220008850097656e-05 Updater{useless} , 2.6226043701171875e-05 , 2.193450927734375e-05 Updater{canonicalize} , 2.6226043701171875e-06 time - (name, class, index, nodes before, nodes after) - validate time 8.991053s - ('save_mem_new_scan', 'TopoOptimizer', 13, 78, 278) - 0.000s TopoOptimizer scan_save_mem nb_node (start, end, changed) (78, 278, 3) init io_toposort 0.00022840499877929688 loop time 8.990811109542847 callback_time 7.609731912612915 8.082378s - ('scan_make_inplace', 'ScanInplaceOptimizer', 48, 71, 75) - 0.003s 7.599082s - ('scan_eqopt2', 'EquilibriumOptimizer', 12, 78, 78) - 0.000s EquilibriumOptimizer scan_eqopt2 time 7.599s for 2 passes nb nodes (start, end, max) 78 78 78 time io_toposort 0.000s time in local optimizers 0.000s time in global optimizers 7.597s time in final optimizers 0.000s time in cleanup optimizers 0.000s 0 - 7.536s 1 (7.534s in global opts, 0.000s io_toposort) - 78 nodes - ('remove_constants_and_unused_inputs_scan', 1) 1 - 0.063s 0 (0.062s in global opts, 0.000s io_toposort) - 78 nodes - times - times applied - nb node created - name: 7.478s - 1 - 3 - remove_constants_and_unused_inputs_scan 0.119s - in 5 optimization that were not used (display only those with a runtime > 0) 0.099s - scan_merge_inouts 0.009s - remove_constants_and_unused_inputs_scan 0.009s - remove_constants_and_unused_inputs_scan 0.001s - constant_folding 0.001s - Global, final and clean up optimizers Iter 0 TopoOptimizer constant_folding_for_scan2 nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00022077560424804688 loop time 0.0001704692840576172 callback_time 0.0 TopoOptimizer scan_remove_constants_and_unused_inputs1 nb_node (start, end, changed) (78, 78, 3) init io_toposort 0.00021529197692871094 loop time 7.472891807556152 callback_time 7.383151531219482 TopoOptimizer scan_remove_constants_and_unused_inputs2 nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00022673606872558594 loop time 0.004494905471801758 callback_time 0.0 TopoOptimizer scan_merge_inouts nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00023412704467773438 loop time 0.0507969856262207 callback_time 0.0 TopoOptimizer scan_remove_constants_and_unused_inputs3 nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00028705596923828125 loop time 0.0043675899505615234 callback_time 0.0 Iter 1 TopoOptimizer constant_folding_for_scan2 nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00023055076599121094 loop time 0.00018644332885742188 callback_time 0.0 TopoOptimizer scan_remove_constants_and_unused_inputs1 nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00024199485778808594 loop time 0.004249095916748047 callback_time 0.0 TopoOptimizer scan_remove_constants_and_unused_inputs2 nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00022482872009277344 loop time 0.004189968109130859 callback_time 0.0 TopoOptimizer scan_merge_inouts nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00023412704467773438 loop time 0.047831058502197266 callback_time 0.0 TopoOptimizer scan_remove_constants_and_unused_inputs3 nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00026535987854003906 loop time 0.004282474517822266 callback_time 0.0 6.947166s - ('ShapeOpt', 'ShapeOptimizer', 3, 463, 463) - 0.000s 3.126614s - ('canonicalize', 'EquilibriumOptimizer', 7, 230, 78) - 0.008s EquilibriumOptimizer canonicalize time 3.126s for 4 passes nb nodes (start, end, max) 230 78 230 time io_toposort 0.002s time in local optimizers 0.982s time in global optimizers 0.000s time in final optimizers 1.627s time in cleanup optimizers 0.853s 0 - 3.088s 291 (1.622s in global opts, 0.001s io_toposort) - 230 nodes - ('MergeOptimizer', 102) ('local_upcast_elemwise_constant_inputs', 51) ('local_dimshuffle_lift', 49) ('local_add_canonizer', 28) ('local_shape_to_shape_i', 23) ... 1 - 0.025s 18 (0.005s in global opts, 0.000s io_toposort) - 88 nodes - ('local_useless_alloc', 3) ('local_useless_switch', 3) ('MergeOptimizer', 3) ('local_mul_canonizer', 2) ('local_add_canonizer', 2) ... 2 - 0.007s 1 (0.000s in global opts, 0.000s io_toposort) - 78 nodes - ('local_add_canonizer', 1) 3 - 0.006s 0 (0.000s in global opts, 0.000s io_toposort) - 78 nodes - times - times applied - nb node created - name: 1.627s - 2 - 0 - topo_constant_folding 0.853s - 105 - 0 - MergeOptimizer 0.168s - 7 - 329 - local_subtensor_merge 0.167s - 31 - 33 - local_add_canonizer 0.102s - 52 - 156 - local_upcast_elemwise_constant_inputs 0.049s - 50 - 42 - local_dimshuffle_lift 0.022s - 22 - 22 - local_subtensor_remove_broadcastable_index 0.018s - 23 - 1 - local_shape_to_shape_i 0.009s - 3 - 3 - local_useless_switch 0.008s - 4 - 3 - local_mul_canonizer 0.006s - 2 - 5 - local_greedy_distributor 0.003s - 2 - 2 - local_neg_to_mul 0.001s - 3 - 0 - local_tensor_scalar_tensor 0.000s - 3 - 0 - local_useless_alloc 0.000s - 1 - 0 - local_pow_canonicalize 0.063s - in 85 optimization that were not used (display only those with a runtime > 0) 0.009s - local_fill_sink 0.009s - local_func_inv 0.008s - local_useless_elemwise 0.005s - local_useless_elemwise_comparison 0.005s - local_mul_zero 0.004s - local_merge_switch_same_cond 0.003s - local_cast_cast 0.002s - local_exp_log 0.002s - local_view_op 0.002s - local_useless_composite 0.002s - local_expm1 0.002s - local_useless_rebroadcast 0.002s - local_setsubtensor_of_constants 0.001s - local_IncSubtensor_serialize 0.001s - local_useless_inc_subtensor 0.001s - local_useless_slice 0.001s - local_mul_switch_sink 0.001s - local_useless_subtensor 0.001s - local_subtensor_of_dot 0.001s - local_subtensor_lift 0.001s - local_useless_inc_subtensor_alloc 0.000s - local_rebroadcast_lift 0.000s - local_useless_dimshuffle_makevector 0.000s - local_incsubtensor_of_zeros 0.000s - local_lift_transpose_through_dot 0.000s - local_scalar_tensor_scalar 0.000s - local_incsubtensor_of_zeros_to_setsubtensor 0.000s - local_subtensor_shape_constant 0.000s - local_subtensor_make_vector 0.000s - local_subtensor_of_alloc 0.000s - local_subtensor_inc_subtensor 0.000s - local_subtensor_SpecifyShape_lift 0.000s - local_div_switch_sink 0.000s - local_zero_div 0.000s - local_Shape_i_of_broadcastable 0.000s - local_alloc_sink_dimshuffle 0.000s - local_useless_SpecifyShape 0.000s - local_track_shape_i 0.000s - local_merge_alloc Global, final and clean up optimizers Iter 0 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (323, 92, 231) init io_toposort 0.0008187294006347656 loop time 1.621021032333374 callback_time 0.0439150333404541 MergeOptimizer nb fail= 0 merged= 588 constant= 353 time replace=0.85 validate=0.00 callback=0.04 Iter 1 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (80, 78, 2) init io_toposort 0.0002655982971191406 loop time 0.0042231082916259766 callback_time 0.0005240440368652344 MergeOptimizer nb fail= 0 merged= 4 constant= 4 time replace=0.00 validate=0.00 callback=0.00 Iter 2 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00022554397583007812 loop time 0.0001728534698486328 callback_time 0.0 MergeOptimizer nb fail= 0 merged= 0 constant= 0 time replace=0.00 validate=0.00 callback=0.00 Iter 3 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00022101402282714844 loop time 0.00016808509826660156 callback_time 0.0 MergeOptimizer nb fail= 0 merged= 0 constant= 0 time replace=0.00 validate=0.00 callback=0.00 2.686624s - ('scan_eqopt1', 'EquilibriumOptimizer', 2, 165, 463) - 0.000s EquilibriumOptimizer scan_eqopt1 time 2.687s for 3 passes nb nodes (start, end, max) 165 463 1035 time io_toposort 0.006s time in local optimizers 0.000s time in global optimizers 2.670s time in final optimizers 0.000s time in cleanup optimizers 0.000s 0 - 2.063s 1 (2.054s in global opts, 0.003s io_toposort) - 1035 nodes - ('SeqOpt([, , , , ])', 1) 1 - 0.288s 1 (0.285s in global opts, 0.001s io_toposort) - 463 nodes - ('SeqOpt([, , , , ])', 1) 2 - 0.335s 0 (0.331s in global opts, 0.002s io_toposort) - 463 nodes - times - times applied - nb node created - name: 2.670s - 2 - 881 - SeqOpt([, , , , ]) 0.000s - in 0 optimization that were not used (display only those with a runtime > 0) Global, final and clean up optimizers Iter 0 SeqOptimizer all_pushout_opt time 2.054s for 165/1035 nodes before/after optimization 0.014s for callback 0.000s for fgraph.validate() 1.685730s - ('push_out_non_seq_scan', 'TopoOptimizer', 1, 162, 1033) - 0.000s TopoOptimizer scan_pushout_nonseqs_ops nb_node (start, end, changed) (162, 1033, 3) init io_toposort 0.0004112720489501953 loop time 1.6853070259094238 callback_time 0.013624906539916992 0.203319s - ('push_out_add_scan', 'TopoOptimizer', 4, 1035, 1035) - 0.000s TopoOptimizer scan_pushout_add nb_node (start, end, changed) (1035, 1035, 0) init io_toposort 0.002458810806274414 loop time 0.2008364200592041 callback_time 0.0 0.089356s - ('push_out_seq_scan', 'TopoOptimizer', 2, 1033, 1035) - 0.000s TopoOptimizer scan_pushout_seqs_ops nb_node (start, end, changed) (1033, 1035, 1) init io_toposort 0.0024344921112060547 loop time 0.08690786361694336 callback_time 0.00019097328186035156 0.070903s - ('remove_constants_and_unused_inputs_scan', 'TopoOptimizer', 0, 165, 162) - 0.000s TopoOptimizer scan_remove_constants_and_unused_inputs0 nb_node (start, end, changed) (165, 162, 1) init io_toposort 0.0004410743713378906 loop time 0.0704503059387207 callback_time 0.00017118453979492188 0.004821s - ('push_out_dot1_scan', 'TopoOptimizer', 3, 1035, 1035) - 0.000s TopoOptimizer scan_pushout_dot1 nb_node (start, end, changed) (1035, 1035, 0) init io_toposort 0.00283050537109375 loop time 0.0019791126251220703 callback_time 0.0 Iter 1 SeqOptimizer all_pushout_opt time 0.285s for 1035/463 nodes before/after optimization 0.003s for callback 0.000s for fgraph.validate() 0.131621s - ('remove_constants_and_unused_inputs_scan', 'TopoOptimizer', 0, 1035, 463) - 0.000s TopoOptimizer scan_remove_constants_and_unused_inputs0 nb_node (start, end, changed) (1035, 463, 3) init io_toposort 0.0026018619537353516 loop time 0.12900519371032715 callback_time 0.003416299819946289 0.056020s - ('push_out_non_seq_scan', 'TopoOptimizer', 1, 463, 463) - 0.000s TopoOptimizer scan_pushout_nonseqs_ops nb_node (start, end, changed) (463, 463, 0) init io_toposort 0.001226663589477539 loop time 0.05478024482727051 callback_time 0.0 0.050227s - ('push_out_seq_scan', 'TopoOptimizer', 2, 463, 463) - 0.000s TopoOptimizer scan_pushout_seqs_ops nb_node (start, end, changed) (463, 463, 0) init io_toposort 0.0013093948364257812 loop time 0.0489039421081543 callback_time 0.0 0.045106s - ('push_out_add_scan', 'TopoOptimizer', 4, 463, 463) - 0.000s TopoOptimizer scan_pushout_add nb_node (start, end, changed) (463, 463, 0) init io_toposort 0.0010845661163330078 loop time 0.043999433517456055 callback_time 0.0 0.002162s - ('push_out_dot1_scan', 'TopoOptimizer', 3, 463, 463) - 0.000s TopoOptimizer scan_pushout_dot1 nb_node (start, end, changed) (463, 463, 0) init io_toposort 0.0012464523315429688 loop time 0.0009045600891113281 callback_time 0.0 Iter 2 SeqOptimizer all_pushout_opt time 0.331s for 463/463 nodes before/after optimization 0.000s for callback 0.000s for fgraph.validate() 0.220640s - ('push_out_add_scan', 'TopoOptimizer', 4, 463, 463) - 0.000s TopoOptimizer scan_pushout_add nb_node (start, end, changed) (463, 463, 0) init io_toposort 0.0011157989501953125 loop time 0.2194676399230957 callback_time 0.0 0.051416s - ('push_out_seq_scan', 'TopoOptimizer', 2, 463, 463) - 0.000s TopoOptimizer scan_pushout_seqs_ops nb_node (start, end, changed) (463, 463, 0) init io_toposort 0.0012049674987792969 loop time 0.05019831657409668 callback_time 0.0 0.049624s - ('push_out_non_seq_scan', 'TopoOptimizer', 1, 463, 463) - 0.000s TopoOptimizer scan_pushout_nonseqs_ops nb_node (start, end, changed) (463, 463, 0) init io_toposort 0.001069784164428711 loop time 0.04854178428649902 callback_time 0.0 0.006926s - ('remove_constants_and_unused_inputs_scan', 'TopoOptimizer', 0, 463, 463) - 0.000s TopoOptimizer scan_remove_constants_and_unused_inputs0 nb_node (start, end, changed) (463, 463, 0) init io_toposort 0.0011239051818847656 loop time 0.005790233612060547 callback_time 0.0 0.002280s - ('push_out_dot1_scan', 'TopoOptimizer', 3, 463, 463) - 0.000s TopoOptimizer scan_pushout_dot1 nb_node (start, end, changed) (463, 463, 0) init io_toposort 0.001316070556640625 loop time 0.0009510517120361328 callback_time 0.0 0.422066s - ('merge1.1', 'MergeOptimizer', 5, 445, 230) - 0.002s MergeOptimizer nb fail= 0 merged= 313 constant= 98 time replace=0.42 validate=0.00 callback=0.02 0.253114s - ('specialize', 'EquilibriumOptimizer', 15, 278, 103) - 0.002s EquilibriumOptimizer specialize time 0.253s for 4 passes nb nodes (start, end, max) 278 103 278 time io_toposort 0.002s time in local optimizers 0.059s time in global optimizers 0.002s time in final optimizers 0.185s time in cleanup optimizers 0.000s 0 - 0.236s 45 (0.184s in global opts, 0.001s io_toposort) - 278 nodes - ('local_add_specialize', 17) ('local_subtensor_remove_broadcastable_index', 17) ('local_shape_to_shape_i', 7) ('topo_constant_folding', 1) ('local_pow_specialize', 1) ... 1 - 0.007s 1 (0.001s in global opts, 0.000s io_toposort) - 105 nodes - ('local_subtensor_merge', 1) 2 - 0.005s 1 (0.001s in global opts, 0.000s io_toposort) - 104 nodes - ('local_useless_subtensor', 1) 3 - 0.005s 0 (0.001s in global opts, 0.000s io_toposort) - 103 nodes - times - times applied - nb node created - name: 0.185s - 1 - 0 - topo_constant_folding 0.020s - 17 - 37 - local_add_specialize 0.010s - 17 - 17 - local_subtensor_remove_broadcastable_index 0.003s - 7 - 0 - local_shape_to_shape_i 0.001s - 1 - 1 - local_useless_slice 0.001s - 1 - 0 - local_useless_subtensor 0.001s - 1 - 1 - local_subtensor_merge 0.001s - 1 - 1 - local_mul_specialize 0.000s - 1 - 1 - local_pow_specialize 0.021s - in 76 optimization that were not used (display only those with a runtime > 0) 0.003s - local_elemwise_alloc 0.002s - local_func_inv 0.002s - local_useless_elemwise 0.002s - crossentropy_to_crossentropy_with_softmax_with_bias 0.001s - local_useless_elemwise_comparison 0.001s - local_useless_inc_subtensor 0.001s - local_remove_useless_assert 0.001s - local_useless_switch 0.001s - local_exp_log_nan_switch 0.001s - local_exp_log 0.001s - local_logsoftmax 0.001s - local_elemwise_sub_zeros 0.001s - local_cast_cast 0.001s - local_expm1 0.001s - local_useless_inc_subtensor_alloc 0.001s - local_useless_rebroadcast 0.001s - local_alloc_unary 0.000s - local_incsubtensor_of_zeros_to_setsubtensor 0.000s - local_dimshuffle_lift 0.000s - local_subtensor_of_dot 0.000s - local_abs_merge 0.000s - local_rebroadcast_lift 0.000s - local_scalar_tensor_scalar 0.000s - local_mul_switch_sink 0.000s - local_useless_dimshuffle_makevector 0.000s - local_subtensor_make_vector 0.000s - local_subtensor_shape_constant 0.000s - local_subtensor_of_alloc 0.000s - local_subtensor_inc_subtensor 0.000s - local_mul_to_sqr 0.000s - local_neg_div_neg 0.000s - local_track_shape_i Global, final and clean up optimizers Iter 0 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (288, 105, 183) init io_toposort 0.0007970333099365234 loop time 0.18248367309570312 callback_time 0.021799802780151367 Iter 1 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (104, 104, 0) init io_toposort 0.0002810955047607422 loop time 0.0002205371856689453 callback_time 0.0 Iter 2 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (103, 103, 0) init io_toposort 0.0002765655517578125 loop time 0.0002162456512451172 callback_time 0.0 Iter 3 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (103, 103, 0) init io_toposort 0.00029397010803222656 loop time 0.00021505355834960938 callback_time 0.0 0.140982s - ('local_inplace_setsubtensor', 'TopoOptimizer', 36, 71, 71) - 0.005s TopoOptimizer local_inplace_setsubtensor nb_node (start, end, changed) (71, 71, 12) init io_toposort 0.0001938343048095703 loop time 0.14076972007751465 callback_time 0.013012886047363281 0.021685s - ('useless', 'TopoOptimizer', 4, 463, 445) - 0.000s TopoOptimizer useless nb_node (start, end, changed) (463, 445, 15) init io_toposort 0.0012607574462890625 loop time 0.020386934280395508 callback_time 0.0009860992431640625 LocalOptGroup --------------------- time taken - times applied - times tried - name - node_created: -0.000s - 0 - 3 - local_merge_alloc - 0 -0.000s - 0 - 3 - local_useless_SpecifyShape - 0 -0.000s - 0 - 3 - local_useless_alloc - 0 -0.000s - 6 - 6 - local_useless_fill - 0 -0.000s - 6 - 6 - local_useless_reduce - 0 -0.000s - 0 - 49 - local_Shape_of_SpecifyShape - 0 -0.000s - 0 - 39 - local_useless_inc_subtensor_alloc - 0 -0.000s - 0 - 72 - local_subtensor_of_alloc - 0 -0.000s - 0 - 72 - local_subtensor_make_vector - 0 -0.000s - 0 - 122 - local_useless_switch - 0 -0.000s - 0 - 72 - local_useless_slice - 0 -0.001s - 0 - 39 - local_useless_inc_subtensor - 0 -0.001s - 0 - 460 - local_view_op - 0 -0.001s - 0 - 122 - local_useless_elemwise_comparison - 0 -0.001s - 0 - 39 - local_useless_rebroadcast - 0 -0.001s - 3 - 463 - local_remove_useless_assert - 0 -0.001s - 0 - 122 - local_useless_elemwise - 0 0.000s - in 14 optimization that were not used (display those with runtime greater than 0) 0.017847s - ('elemwise_fusion', 'SeqOptimizer', 20, 103, 87) - 0.000s SeqOptimizer elemwise_fusion time 0.018s for 103/87 nodes before/after optimization 0.001s for callback 0.000s for fgraph.validate() 0.017357s - ('composite_elemwise_fusion', 'FusionOptimizer', 1, 103, 87) - 0.000s FusionOptimizer nb_iter 4 nb_replacement 6 nb_inconsistency_replace 0 validate_time 4.482269287109375e-05 callback_time 0.00067901611328125 time_toposort 0.0011150836944580078 0.000467s - ('local_add_mul_fusion', 'FusionOptimizer', 0, 103, 103) - 0.000s FusionOptimizer nb_iter 1 nb_replacement 0 nb_inconsistency_replace 0 validate_time 0.0 callback_time 0.0 time_toposort 0.00029778480529785156 0.013326s - ('BlasOpt', 'SeqOptimizer', 14, 278, 278) - 0.000s SeqOptimizer BlasOpt time 0.013s for 278/278 nodes before/after optimization 0.000s for callback 0.000s for fgraph.validate() 0.003695s - ('use_c_blas', 'TopoOptimizer', 4, 278, 278) - 0.000s TopoOptimizer use_c_blas nb_node (start, end, changed) (278, 278, 0) init io_toposort 0.0007138252258300781 loop time 0.0029571056365966797 callback_time 0.0 0.003420s - ('local_gemm_to_gemv', 'EquilibriumOptimizer', 3, 278, 278) - 0.000s EquilibriumOptimizer local_gemm_to_gemv time 0.003s for 1 passes nb nodes (start, end, max) 278 278 278 time io_toposort 0.001s time in local optimizers 0.000s time in global optimizers 0.000s time in final optimizers 0.000s time in cleanup optimizers 0.000s 0 - 0.003s 0 (0.000s in global opts, 0.001s io_toposort) - 278 nodes - 0.001848s - ('gemm_optimizer', 'GemmOptimizer', 1, 278, 278) - 0.000s GemmOptimizer nb_iter 1 nb_replacement 0 nb_replacement_didn_t_remove 0 nb_inconsistency_make 0 nb_inconsistency_replace 0 time_canonicalize 0.00021600723266601562 time_factor_can 0 time_factor_list 0 time_toposort 0.0006978511810302734 validate_time 0.0 callback_time 0.0 0.001461s - ('local_dot22_to_dot22scalar', 'TopoOptimizer', 2, 278, 278) - 0.000s TopoOptimizer local_dot22_to_dot22scalar nb_node (start, end, changed) (278, 278, 0) init io_toposort 0.0007007122039794922 loop time 0.0007398128509521484 callback_time 0.0 0.001458s - ('use_scipy_ger', 'TopoOptimizer', 5, 278, 278) - 0.000s TopoOptimizer scipy_blas nb_node (start, end, changed) (278, 278, 0) init io_toposort 0.0008904933929443359 loop time 0.0005426406860351562 callback_time 0.0 0.001408s - ('local_dot_to_dot22', 'TopoOptimizer', 0, 278, 278) - 0.000s TopoOptimizer local_dot_to_dot22 nb_node (start, end, changed) (278, 278, 0) init io_toposort 0.0008172988891601562 loop time 0.0005664825439453125 callback_time 0.0 0.006271s - ('merge1', 'MergeOptimizer', 1, 198, 165) - 0.000s MergeOptimizer nb fail= 0 merged= 70 constant= 37 time replace=0.00 validate=0.00 callback=0.00 0.005435s - ('inplace_elemwise_opt', 'InplaceElemwiseOptimizer', 47, 71, 71) - 0.001s InplaceElemwiseOptimizer node_before 71 nb_call_replace 3 nb_call_validate 3 nb_inconsistent 0 ndim nb 0 2 1 1 0.004169s - ('merge2', 'MergeOptimizer', 22, 87, 71) - 0.000s MergeOptimizer nb fail= 0 merged= 51 constant= 35 time replace=0.00 validate=0.00 callback=0.00 0.003990s - ('stabilize', 'EquilibriumOptimizer', 9, 78, 78) - 0.000s EquilibriumOptimizer stabilize time 0.004s for 1 passes nb nodes (start, end, max) 78 78 78 time io_toposort 0.000s time in local optimizers 0.002s time in global optimizers 0.000s time in final optimizers 0.000s time in cleanup optimizers 0.000s 0 - 0.004s 0 (0.001s in global opts, 0.000s io_toposort) - 78 nodes - Global, final and clean up optimizers Iter 0 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.0002281665802001953 loop time 0.0001735687255859375 callback_time 0.0 0.002956s - ('merge3', 'MergeOptimizer', 52, 75, 75) - 0.000s MergeOptimizer nb fail= 7 merged= 0 constant= 0 time replace=0.00 validate=0.00 callback=0.00 0.002933s - ('uncanonicalize', 'EquilibriumOptimizer', 17, 103, 103) - 0.000s EquilibriumOptimizer uncanonicalize time 0.003s for 1 passes nb nodes (start, end, max) 103 103 103 time io_toposort 0.000s time in local optimizers 0.000s time in global optimizers 0.000s time in final optimizers 0.001s time in cleanup optimizers 0.000s 0 - 0.003s 0 (0.001s in global opts, 0.000s io_toposort) - 103 nodes - Global, final and clean up optimizers Iter 0 TopoOptimizer topo_constant_folding nb_node (start, end, changed) (103, 103, 0) init io_toposort 0.0002803802490234375 loop time 0.0002300739288330078 callback_time 0.0 0.002910s - ('local_IncSubtensor_serialize', 'TopoOptimizer', 6, 230, 230) - 0.000s TopoOptimizer pre_local_IncSubtensor_serialize nb_node (start, end, changed) (230, 230, 0) init io_toposort 0.0014812946319580078 loop time 0.0013837814331054688 callback_time 0.0 0.002344s - ('specialize_device', 'EquilibriumOptimizer', 18, 103, 103) - 0.000s EquilibriumOptimizer specialize_device time 0.002s for 1 passes nb nodes (start, end, max) 103 103 103 time io_toposort 0.000s time in local optimizers 0.000s time in global optimizers 0.000s time in final optimizers 0.000s time in cleanup optimizers 0.000s 0 - 0.002s 0 (0.000s in global opts, 0.000s io_toposort) - 103 nodes - 0.002172s - ('blas_opt_inplace', 'TopoOptimizer', 41, 71, 71) - 0.000s TopoOptimizer InplaceBlasOpt nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.0001881122589111328 loop time 0.0019562244415283203 callback_time 0.0 0.001830s - ('local_dnna_conv_inplace', 'TopoOptimizer', 44, 71, 71) - 0.000s TopoOptimizer local_dnna_conv_inplace nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.0002052783966064453 loop time 0.001589059829711914 callback_time 0.0 0.001811s - ('gpuablas_opt_inplace', 'TopoOptimizer', 42, 71, 71) - 0.000s TopoOptimizer InplaceGpuaBlasOpt nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00020265579223632812 loop time 0.0015709400177001953 callback_time 0.0 0.001670s - ('c_blas_destructive', 'TopoOptimizer', 43, 71, 71) - 0.000s TopoOptimizer c_blas_destructive nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00022673606872558594 loop time 0.0014247894287109375 callback_time 0.0 0.000896s - ('AbstractConvCheck', 'TopoOptimizer', 19, 103, 103) - 0.000s TopoOptimizer AbstractConvCheck nb_node (start, end, changed) (103, 103, 0) init io_toposort 0.0002989768981933594 loop time 0.0005784034729003906 callback_time 0.0 0.000863s - ('inline_ofg_expansion', 'TopoOptimizer', 0, 198, 198) - 0.000s TopoOptimizer inline_ofg_expansion nb_node (start, end, changed) (198, 198, 0) init io_toposort 0.0004627704620361328 loop time 0.0003819465637207031 callback_time 0.0 0.000628s - ('add_destroy_handler', 'AddDestroyHandler', 23, 71, 71) - 0.000s 0.000555s - ('local_elemwise_alloc', 'TopoOptimizer', 11, 78, 78) - 0.000s TopoOptimizer local_elemwise_alloc nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.0002257823944091797 loop time 0.00031828880310058594 callback_time 0.0 0.000489s - ('random_make_inplace', 'TopoOptimizer', 50, 75, 75) - 0.000s TopoOptimizer random_make_inplace nb_node (start, end, changed) (75, 75, 0) init io_toposort 0.0002105236053466797 loop time 0.00025773048400878906 callback_time 0.0 0.000453s - ('local_advincsub1_gpua_inplace', 'TopoOptimizer', 24, 71, 71) - 0.000s TopoOptimizer local_advincsub1_gpua_inplace nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00020360946655273438 loop time 0.00022721290588378906 callback_time 0.0 0.000429s - ('cond_make_inplace', 'TopoOptimizer', 49, 75, 75) - 0.000s TopoOptimizer cond_make_inplace nb_node (start, end, changed) (75, 75, 0) init io_toposort 0.00023126602172851562 loop time 0.00018358230590820312 callback_time 0.0 0.000412s - ('local_inplace_DiagonalSubtensor', 'TopoOptimizer', 31, 71, 71) - 0.000s TopoOptimizer local_inplace_DiagonalSubtensor nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.000194549560546875 loop time 0.0002071857452392578 callback_time 0.0 0.000406s - ('local_fill_to_alloc', 'TopoOptimizer', 10, 78, 78) - 0.000s TopoOptimizer local_fill_to_alloc nb_node (start, end, changed) (78, 78, 0) init io_toposort 0.00022602081298828125 loop time 0.00016570091247558594 callback_time 0.0 0.000405s - ('mrg_random_make_inplace', 'TopoOptimizer', 51, 75, 75) - 0.000s TopoOptimizer random_make_inplace_mrg nb_node (start, end, changed) (75, 75, 0) init io_toposort 0.00020837783813476562 loop time 0.00018930435180664062 callback_time 0.0 0.000391s - ('local_inplace_sparse_block_gemv', 'TopoOptimizer', 37, 71, 71) - 0.000s TopoOptimizer local_inplace_sparse_block_gemv nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00020885467529296875 loop time 0.0001556873321533203 callback_time 0.0 0.000387s - ('local_batch_norm_inference_inplace', 'TopoOptimizer', 25, 71, 71) - 0.000s TopoOptimizer local_batch_norm_inference_inplace nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00020265579223632812 loop time 0.0001609325408935547 callback_time 0.0 0.000376s - ('local_batch_norm_inplace_output', 'TopoOptimizer', 26, 71, 71) - 0.000s TopoOptimizer local_batch_norm_inplace_output nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.0001990795135498047 loop time 0.000156402587890625 callback_time 0.0 0.000367s - ('local_inplace_gpu_magma_cholesky', 'TopoOptimizer', 33, 71, 71) - 0.000s TopoOptimizer local_inplace_gpu_magma_cholesky nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.0001938343048095703 loop time 0.00015664100646972656 callback_time 0.0 0.000364s - ('local_inplace_sparseblockgemv', 'TopoOptimizer', 39, 71, 71) - 0.000s TopoOptimizer local_inplace_sparseblockgemv nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.0001900196075439453 loop time 0.00015616416931152344 callback_time 0.0 0.000363s - ('local_inplace_sparse_block_outer', 'TopoOptimizer', 38, 71, 71) - 0.000s TopoOptimizer local_inplace_sparse_block_outer nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00019311904907226562 loop time 0.0001494884490966797 callback_time 0.0 0.000362s - ('local_inplace_gpu_cholesky', 'TopoOptimizer', 32, 71, 71) - 0.000s TopoOptimizer local_inplace_gpu_cholesky nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00019431114196777344 loop time 0.00015163421630859375 callback_time 0.0 0.000360s - ('local_inplace_gpu_solve', 'TopoOptimizer', 35, 71, 71) - 0.000s TopoOptimizer local_inplace_gpu_solve nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.0001933574676513672 loop time 0.0001513957977294922 callback_time 0.0 0.000359s - ('local_batch_norm_inplace_running_var', 'TopoOptimizer', 28, 71, 71) - 0.000s TopoOptimizer local_batch_norm_inplace_running_var nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00020265579223632812 loop time 0.0001399517059326172 callback_time 0.0 0.000358s - ('local_batch_norm_inplace_running_mean', 'TopoOptimizer', 27, 71, 71) - 0.000s TopoOptimizer local_batch_norm_inplace_running_mean nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00019693374633789062 loop time 0.00014328956604003906 callback_time 0.0 0.000357s - ('local_inplace_sparseblockouter', 'TopoOptimizer', 40, 71, 71) - 0.000s TopoOptimizer local_inplace_sparseblockouter nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00019049644470214844 loop time 0.0001499652862548828 callback_time 0.0 0.000356s - ('local_inplace_gpu_magma_matrix_inverse', 'TopoOptimizer', 34, 71, 71) - 0.000s TopoOptimizer local_inplace_gpu_magma_matrix_inverse nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.0001933574676513672 loop time 0.0001533031463623047 callback_time 0.0 0.000354s - ('local_inplace_AdvancedIncSubtensor', 'TopoOptimizer', 29, 71, 71) - 0.000s TopoOptimizer local_inplace_AdvancedIncSubtensor nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00019431114196777344 loop time 0.00014400482177734375 callback_time 0.0 0.000349s - ('local_inplace_AdvancedIncSubtensor1', 'TopoOptimizer', 30, 71, 71) - 0.000s TopoOptimizer local_inplace_AdvancedIncSubtensor1 nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.00019407272338867188 loop time 0.0001404285430908203 callback_time 0.0 0.000336s - ('make_ger_destructive', 'TopoOptimizer', 45, 71, 71) - 0.000s TopoOptimizer make_scipy_blas_destructive nb_node (start, end, changed) (71, 71, 0) init io_toposort 0.0002048015594482422 loop time 0.00011682510375976562 callback_time 0.0 0.000324s - ('crossentropy_to_crossentropy_with_softmax', 'FromFunctionOptimizer', 16, 103, 103) - 0.000s 0.000303s - ('gpua_elemwise_fusion', 'FusionOptimizer', 21, 87, 87) - 0.000s FusionOptimizer nb_iter 1 nb_replacement 0 nb_inconsistency_replace 0 validate_time 0.0 callback_time 0.0 time_toposort 0.0002446174621582031 0.000291s - ('gpua_inplace_opt', 'InplaceElemwiseOptimizer', 46, 71, 71) - 0.000s InplaceElemwiseOptimizer node_before 71 nb_call_replace 0 nb_call_validate 0 nb_inconsistent 0 0.000010s - ('merge1.2', 'MergeOptimizer', 8, 78, 78) - 0.000s MergeOptimizer nb fail= 0 merged= 0 constant= 0 time replace=0.00 validate=0.00 callback=0.00 ```

An important thing to notice about this output is that the total optimizer time is rather small compared to the actual time it takes to compile the function (not shown, but it's in the minutes).

This appears to be due to Scan.make_thunk being called multiple times during compilation, which is—in part—due to the presence of nested Scans, but also apparently due to multiple compilations of the same Scans (e.g. I'm seeing duplicated verbose optimizer output).

It looks like we might want to address the reason why we're doing multiple Op.make_thunk passes on the same graphs; otherwise, this repeat-compilation problem would be directly solved by https://github.com/aesara-devs/aesara/issues/777. https://github.com/aesara-devs/aesara/issues/777 would also address our inability to gather complete optimization profile statistics for nested Scans.