Speed up preprocessing?

spth commented 4 years ago

For one of the largest function in the standard library that comes with SDCC, printf, treedec::comb::PP_FI_TM takes much longer to get a tree decomposition than Thorup.

According to valgrind, at -O1, 71 % of the time spent in treedec::comb::PP_FI_TM is just for the boost::clear_vertex call in line 293 of preprocessing.hpp (the call that has the // expensive? comment directly above it).

Would it be possible to improve speed here somehow?

felix-salfelder commented 4 years ago

On Sun, Nov 15, 2020 at 08:29:48AM -0800, Philipp Klaus Krause wrote:

Would it be possible to improve speed here somehow?

I can think of several ways. The plan was to create an interface to the preprocessor, and then try other implementations. It never happened, but there are some traces in the code hinting at how to do it.

It is also well possible that the bottleneck moves to another place when using a different adjacency structure. linked lists tend to be relatively slow for smaller degree.

llarisch commented 4 years ago

Can you please report on runtime for PP, PP_FI and PP_FI_TM separately such that we know how much time is spent in the subalgorithms? The complexity of clear_vertex depends on the edge set type chosen. What graph type are you using?

spth commented 4 years ago

Can you please report on runtime for PP, PP_FI and PP_FI_TM separately such that we know how much time is spent in the subalgorithms? The complexity of clear_vertex depends on the edge set type chosen. What graph type are you using?

I can only reproduce this at -O1. So I do not know how much of a problem there is at -O2 vs. me just not being able to read the information at -O2 due to stuff not being visible in valgrind.

Relative (to treedec::comb::PP_FI_TM) runtime: 100% in treedec::comb::PP_FI_TM, distributed as follows: 22 % in treedec::glue_bag 77% in treedec::preprocessing, distributed as follows: 71% in boost::clear_vertex, distributed as follows: 30% in delete 17% in new

treedec::comb::PP_FI_TM) is called from three places, relative runtime: 21% redundancy elimination 40% address space switching 39% register allocation

In each case, a tree-decomposition of a cfg is computed. The cfg is first copied into a cfg2_t, which is then passed to treedec::comb::PP_FI_TM.

typedef boost::adjacency_list<boost::setS, boost::vecS, boost::undirectedS> cfg2_t;
cfg2_t cfg2;
copy_undir(cfg2, cfg);
treedec::comb::PP_FI_TM<cfg2_t> a2(cfg2);
a2.do_it();
a2.get_tree_decomposition(tree_dec2);

felix-salfelder commented 3 years ago

71% in boost::clear_vertex, distributed as follows: 30% in delete 17% in new

This should improve with vector based adjacency list, as these are manipulated in-place.

typedef boost::adjacency_list<boost::setS, boost::vecS, boost::undirectedS> cfg2_t;
cfg2_t cfg2;
typedef boost::adjacency_list<boost::vecS, boost::vecS, boost::undirectedS> cfg2_t;
cfg2v_t cfg2;
treedec::comb::PP_FI_TM<cfg2v_t> a2(cfg2);

Should do and make use of boost::erase_if. I will make sure this works and add a test as soon as I can.

Also, please check if you specified NDEBUG during compilation. Omitting it will result in a debug mode build which may be an order of magnitude slower.

freetdi / tdlib

Speed up preprocessing? #35