cfmrp / mtool

Software to Manipulate Different Flavors of Semantic Graphs
http://mrp.nlpl.eu
GNU Lesser General Public License v3.0
51 stars 24 forks source link

Allow hill-climbing initialization using SMATCH causes major slowdown #52

Closed foxik closed 5 years ago

foxik commented 5 years ago

Consider the 5th sentence from eds sample:

sed -n '5p' data/sample/eds/wsj.mrp >eds5

Running in current head (07b10499f28a at the time of writing) the following command

main.py --score mces --gold eds5 --read mrp eds5

takes 9 seconds, while turning off SMATCH as hill-climbing initialization by adding if False on line https://github.com/cfmrp/mtool/blob/07b10499f28ae9d67da6a811019bfcc5cd3ba978/score/mces.py#L326 allows the above command to finish in ~0.8s.

Running

main.py --score smatch --gold eds5 --read mrp eds5

finishes in ~1s.

The problem is that smatch call at https://github.com/cfmrp/mtool/blob/07b10499f28ae9d67da6a811019bfcc5cd3ba978/score/mces.py#L328 takes a long time -- compared to smatch metric, an explicit limit of 50 is provided, while smatch uses a default limit of 0, which is 5, see https://github.com/cfmrp/mtool/blob/07b10499f28ae9d67da6a811019bfcc5cd3ba978/score/smatch.py#L79

Currently running mces evaluation on eds sample (89 sentences) takes ~5 minutes, while not using SMATCH initialization takes ~20s; using SMATCH initialization with limit 5 takes ~45s; using SMATCH initialization with limit 2 takes ~28s. So without any understanding of the internals of the metrics, the SMATCH initialization does not help in running faster (but it may of course help obtaining better results; but I think mces should give deterministic results).

oepen commented 5 years ago

this turned out to be a more convoluted space than we had first anticipated! in general, MCES tends to be at least an order of magnitude slower than SMATCH (or, more precisely, its random-restart hill-climbing search: RRHC), at least when using the rather generous default limit of half a million search steps. MCES by itself should be deterministic, yes, but it does not always find an exact solution within a given (reasonable) limit.

in general, there is substantial variation in how different parameterizations of the scorers effect results, both in terms of scores and running time.

we recently activated MCES initialization with RRHC-based node correspondences, to make sure that MRP scores will always be at least the same as or higher than SMATCH. this adds potential for non-determinism, which we hope to counter-act with the generous limit of 50 for the RRHC part. this makes MCES substantially slower for two reasons: (a) the added time for the RRHC search and (b) reduced compatiblity between the initial correspondence and the scheduling heuristics in the MCES search.

we already feel a tad bad about tweaking the MRP metric this close to the evaluation period, hence will now be increasingly cautious about making further revisions (until after the official scores have been computed, that is :-). our reasoning in recent decision making has been that reliability (accuracy, determinism, and such) is more important than efficiency. in system development, i can of course see that waiting many minutes for an updated score can be inconvenient! the framework-specific metrics all tend to run a lot faster than the MRP scorer, and (as you are starting to explore) one can of course reduce the --limit for the MCES search and disable the RRHC-based initialization, to reduce running time. with the possible exception of the UCCA graphs (where we are still gathering experimental data ourselves), i currently have no reason to believe that such optimizations much a substantial difference.

foxik commented 5 years ago

I definitely understand the concern of changing the scorer. Just few remarks (without any call to action :-)

oepen commented 5 years ago

thanks again, milan! i am closing this issue now, as we have consolidated default values for the two metrics now and provided more control from the command line. while some of the code optimizations merged into the ‘dev’ branch earlier today can result in speed-ups of up to a factor of two, at this point i suspect our best short-term shot at dramatically increasing experimental turn-around will be through #16 (trivial parallelization of the scorer). i hope to still look into that before next week!