gap-system / gap

Main development repository for GAP - Groups, Algorithms, Programming, a System for Computational Discrete Algebra
https://www.gap-system.org
GNU General Public License v2.0
800 stars 163 forks source link

grpauto.tst spends a lot of time in CmpVec8BitVec8Bit #1868

Open stevelinton opened 6 years ago

stevelinton commented 6 years ago

Just adding this as an issue to raise and share the results of some poking around I've been doing. No immediate problem as such.

Observed behaviour

Profiling teststandard with gprof reveals that a lot of time is being spent in the kernel function CmpVec8BitVec8Bit. About 3 billion calls, taking a total of 138s on mandel. A few weeks ago there were some other outliers as well, like QuoIntPerm2 and ElmsBlist, but we've dealt with those. Anyway, the comparisons turn out to come from testextra/grpauto.tst where part of the automorphism group code (CompatiblePairs) ends up doing a lot of calls like m in sorted_list_of_matrices where m is typically a small compressed matrix and sorted_list_of_matrices is also often quite short. The kernel delegates from \in to Position which ends up doing a binary search, which in turn does lots of comparisons of compressed matrices, which then compare the rows...

Each individual call is not very expensive, but it involves accessing quite a lot of bits of data that may be in different cache lines -- masterpointer, object, masterpointer of type and type id for the matrices, then masterpointer, object and masterpointer of field info and field info for each vector. Once it's done that, it compares one or two bytes of data in each row -- if they're not equal it also has to convert the entries in the final byte into FFEs and compare those... All in all a lot of work and cache pressure to compare two 2x2 matrices over GF(7). Although experiments suggest that this is still about 30% faster than comparing uncompressed matrices.

By the way, if you're trying to replicate this, you want to remove tst/testextra/switch_obj.tst which does a LOT of forced garbage collections and distorts the profile. In the rest of the tests, GC time is generally under 10%, a little more for permutation-heavy code.

Expected behaviour

I guess the expected behaviour is to run faster, but it's far from clear what the best way to achieve that is. It may be that CompatiblePairs can use a hash table instead, which means many matrices would not need to be accessed because they have the wrong hash value (or to use a permutation representation for smaller matrix groups) or avoid the work some other way. A nicer solution would be an alternative compressed matrix representation designed for small matrices. If you don't mind the rows of the matrix not being objects, then you can design much nicer representations, but this might break quite a lot of code like for row in mat do for elm in row do .... od; od; in could be a bit cleverer about not delegating to Position for very short lists as well, which might save one or two comparisons for lists of length 1 and maybe 2.

fingolfin commented 6 years ago

As to the last part: allowing matrices where the rows are not separate objects is of course a central aspect of the matrixobj project, so I consider this here another strong motivation to work on that.

hulpke commented 6 years ago

I'm a bit surprised that this ends up as a timing issue, since the code (I think I'm talking about the same as you observe) is using element lists to save on element tests.

Roughly, a generating set for a subgroup is built up in a spinning algorithm. The basic operation thus is to test membership of an element in the group so far. Now, if the group has at most 50000 elements, an element list is used instead of doing an algorithmic element test (that would translate to a permutation group, having to obtain the permutation image of a matrix first).

Now a lot of tests are done. (and indeed one can reduce this number, see 16de963efe26c9653cb15369ad54de22991da947, which might help a bit).

What I'm confused about is what would be better:

a) Do an algorithmic element test: I cannot see how this would improve, as this involves a lot of arithmetic

b) Hashing: The current hash routine for matrices builds on FF vectors, forming arithmetic with vector entries and calling Int to move into characteristic zero. I cannot imagine this being faster. Is there a matrix hash key function in the kernel that really just looks at how things are stored? (Caveat: Hash function must return same value for compressed/uncompressed!) This would be useful in other places.

Alternatively, one could look at a few positions that make all matrices differ and extract these into a vector for hashing.

c) Convert the matrices to vectors by concatenation?

d) The planned matrix objects?

If someone has a better idea on how to speed up these redundancy tests, I'd be interested to hear.

fingolfin commented 6 years ago

Just some rambling and somewhat incoherent thoughts and remarks: I don't think the options @hulpke lists are necessarily mutually exclusive. For top performance, I believe one must force all matrices to use the same representation. This makes it possible to effectively cache the comparison function, and use a single hash function (which then also does not need to bother to stay compatible with other representations, and thus can be super tuned). Also, using "true" compressed matrices will help in certain cases, e.g. the 2x2 GF(7) Steve mentioned, which would fit into 12 bits. (If the matrix rep is tight like that, of course it might be better to e.g. use a balanced tree or something similar, like a sorted list, instead of a hash table. Heck, if it only takes up 12 bits, we can have a 2^12=4096 byte bit set which records for each matrix whether it's in there or not; but of course having such tiny vector spaces is a very special case).

Not needing to store separate row objects would save indirections, cache misses and lots of overhead (on a 64 bit system, every positional object has a 16 byte header, plus another 8 bytes for a pointer to the type; plus the master pointer; and the bag size is effectively rounded up to a multiple of 8. So even if the actual data is just 2 bytes, we store a whopping 32 bytes (nicely confirmed by MemoryUsage([]) also giving 32). This does not yet include the master pointer, and access to these objects typically involves dereferencing that, and also the type object pointers. That all really adds up compared to the work of comparing those few bits in the vector.

By squashing the rows together and thus having them effectively share a single type, we avoid lots of indirections and can also squeeze more data into the cache.

Of course one can then still easily beat this with custom code, which e.g. implements a "list of compressed matrices of identical type", where each matrix actually is not a full object/bag, but rather just a few reserved bytes in a single larger object; i.e. something vaguely like a "tensor" (but optimized for this usecase, i.e. looking up whether a given matrix is contained; accessing "subobjects" would be relatively slow, as a new "real" matrix object would have to be created on the fly....) Anyway, this would save a lot of additional overhead and ensure everything is stored together, hopefully in the same cache line(s).

Of course for matrices of larger dimension resp. over larger fields, different approaches might be better; there probably is no single "perfect" data structure for this, and (as so often) various heuristics would have to be applied to choose a (hopefully) good one.

fingolfin commented 6 years ago

So "Convert the matrices to vectors by concatenation" would be helpful right now (it avoids the row objects), and with MatrixObj you can do the same internally, while still being able to access the data as a matrix. So this is mostly good for convenience.

If you then ensure all those concatenated vectors / matrix objects use the same rep, you can use a hash table (or whatever) with a fixed and efficient hash function. And also lookup and reuse the equality method, thus avoid method lookup.

stevelinton commented 6 years ago

Just to report a few more experiments: Alexander's patch does reduce the time in CmpVec8Bit... a bit, but adds much more time in permutation multiply. I tried some changes to InList. At the moment a successful '\in' test for length 1 lists does two comparisons -- one to discover that the object belongs in position 1 and the other to check that it is equal to what is currently in position 1. This can obviously be improved. I also adjusted the vector code so that if all you want is an equality test it doesn't mess around trying to sort out which is less (which is fairly complicated). This dramatically reduced the number of calls to the comparison, but only saved about 10% of the time. It seems that doing the same comparison twice is NOT much more expensive than doing it once, which supports my conjecture that this is all about cache misses.

This does support the thrust of the matrix objects work to allow a whole matrix (or even a whole set of matrices) to reside in one object. Hashing would also help because a successful '\in' test would normally only have to do one hash and one matrix comparison, no matter how large the set (although in this case the set often seems to be length 1).

hulpke commented 6 years ago

I think at this point we're in a situation where a significant improvement would come from hashing. The current hashing of matrices over a finite field happens through (probably badly written) ad-hoc library code. To improve on this I think we need:

If such kernel fiunctionality becomes available, I'd be happy to rewrite this (and other) matrix group code to improve performance as has been outlined.

fingolfin commented 6 years ago

Yes, this is definitely something I want, too, and part of our (well, mine at least) motivation for creating the datastructures package was to get to this eventually. But not for 4.9, obviously, maybe for 4.10 :-)