Open fingolfin opened 8 years ago
This is really weird, because the Sylow subgroup is just written down -- there is no calculation, but just construction of permutations and of a group.
I have tried profiling this (you can see the outputs at https://caj.host.cs.st-andrews.ac.uk/covs/gap-out/ - gap and https://caj.host.cs.st-andrews.ac.uk/covs/hpc-out/ - HPCgap). These were generated by:
ProfileLineByLine("~/profout.gz"); SylowSubgroup(SymmetricGroup(350),2);; UnprofileLineByLine();
(remember to call the two prof outputs different names!) then running
LoadPackage("profiling"); OutputAnnotatedCodeCoverageFiles("~/profout.gz", "~/temp/hpc-out");
(this uses profiling 0.4.0, released today).
In both cases the time is being spent in StabChainRandomPermGroup, which spends most of it's time in SCRSift. However, it is ImageInWord
which seems much slower. Not sure why (and of course, these profiles could be misleading!)
About half the overhead is due to read- and writeguards. You can see this by compiling with make WARD=
to disable Ward. It may well be that this involves code where Ward doesn't do a good job of hoisting checks out of loops (another reason why I think a rewrite of Ward matters, incidentally).
I'm really not sure where the other 15% come from right now, though. If I had to guess, I'd say it's a performance regression either in the lib or in the kernel, but even with Instruments, I can't find a hotspot in the kernel that looks particularly crowded.
It looks like the main culprit here is the number of permutation objects that are being created and immediately discarded in line 671 of stbcrand.gi. This throws up a few things:
One problem is that the optimisations which are right for small base (avoid multiplying permutations at any cost) are not right in other settings.
Anyway I'll try a kernel version of SCRSift that multiplies a permutation in-place.
It looks like the main culprit here is the number of permutation objects that are being created and immediately discarded in line 671 of stbcrand.gi.
While that may contribute performance loss, I did not observe a whole lot of time being spent in the GC for this particular example. In fact, disabling the GC entirely did only improve performance by 3%-4% on its own (though this does not account for allocations being more expensive on balance). Most likely, it's a lot of little things coming together.
Kernel version of SCRSift in #525. That's set up as a patch to master but I see no reason why it shouldn't work in HPCGAP.
Consider this command:
SylowSubgroup(SymmetricGroup(350),2);;
In HPC-GAP this takes 7.3 seconds for me, in classic GAP only 5.3 seconds -- so HPC-GAP is about 30% slower.
It would be nice to know why this is so.... And perhaps we can improve this a little bit?
UPDATE: The original example now just takes a few milliseconds on either system due to algorithmic improvements. But other examples exist. E.g. from PR #1326 (which includes more details):
g:=WreathProduct(MathieuGroup(9),Group((1,2)));; ConjugacyClassesSubgroups(g);;
For me that takes about 3.2 seconds in GAP, and 5.7 seconds in HPC-GAP.