gap-system / gap

Main development repository for GAP - Groups, Algorithms, Programming, a System for Computational Discrete Algebra
https://www.gap-system.org
GNU General Public License v2.0
813 stars 161 forks source link

Understanding a HPC-GAP speed penalty #499

Open fingolfin opened 8 years ago

fingolfin commented 8 years ago

Consider this command: SylowSubgroup(SymmetricGroup(350),2);;

In HPC-GAP this takes 7.3 seconds for me, in classic GAP only 5.3 seconds -- so HPC-GAP is about 30% slower.

It would be nice to know why this is so.... And perhaps we can improve this a little bit?

UPDATE: The original example now just takes a few milliseconds on either system due to algorithmic improvements. But other examples exist. E.g. from PR #1326 (which includes more details): g:=WreathProduct(MathieuGroup(9),Group((1,2)));; ConjugacyClassesSubgroups(g);; For me that takes about 3.2 seconds in GAP, and 5.7 seconds in HPC-GAP.

hulpke commented 8 years ago

This is really weird, because the Sylow subgroup is just written down -- there is no calculation, but just construction of permutations and of a group.

ChrisJefferson commented 8 years ago

I have tried profiling this (you can see the outputs at https://caj.host.cs.st-andrews.ac.uk/covs/gap-out/ - gap and https://caj.host.cs.st-andrews.ac.uk/covs/hpc-out/ - HPCgap). These were generated by:

ProfileLineByLine("~/profout.gz"); SylowSubgroup(SymmetricGroup(350),2);; UnprofileLineByLine(); (remember to call the two prof outputs different names!) then running LoadPackage("profiling"); OutputAnnotatedCodeCoverageFiles("~/profout.gz", "~/temp/hpc-out"); (this uses profiling 0.4.0, released today).

In both cases the time is being spent in StabChainRandomPermGroup, which spends most of it's time in SCRSift. However, it is ImageInWord which seems much slower. Not sure why (and of course, these profiles could be misleading!)

rbehrends commented 8 years ago

About half the overhead is due to read- and writeguards. You can see this by compiling with make WARD= to disable Ward. It may well be that this involves code where Ward doesn't do a good job of hoisting checks out of loops (another reason why I think a rewrite of Ward matters, incidentally).

I'm really not sure where the other 15% come from right now, though. If I had to guess, I'd say it's a performance regression either in the lib or in the kernel, but even with Instruments, I can't find a hotspot in the kernel that looks particularly crowded.

stevelinton commented 8 years ago

It looks like the main culprit here is the number of permutation objects that are being created and immediately discarded in line 671 of stbcrand.gi. This throws up a few things:

  1. This whole routine SCRSift is hot spot and should probably be recoded in C.
  2. Adding a kernel routine to multiply a whole list of permutations without creating intermediates might help this and similar code by reducing waste bag creation, at the cost of making code less elegant
  3. Quite often SCRSift is called in the for IsOne(SCRSift(...)) This can be speeded up (a) because there is no need to allocate a bag at all and (b) because as soon as we find one point not fixed by the sifted word we know the answer.

One problem is that the optimisations which are right for small base (avoid multiplying permutations at any cost) are not right in other settings.

Anyway I'll try a kernel version of SCRSift that multiplies a permutation in-place.

rbehrends commented 8 years ago

It looks like the main culprit here is the number of permutation objects that are being created and immediately discarded in line 671 of stbcrand.gi.

While that may contribute performance loss, I did not observe a whole lot of time being spent in the GC for this particular example. In fact, disabling the GC entirely did only improve performance by 3%-4% on its own (though this does not account for allocations being more expensive on balance). Most likely, it's a lot of little things coming together.

stevelinton commented 8 years ago

Kernel version of SCRSift in #525. That's set up as a patch to master but I see no reason why it shouldn't work in HPCGAP.