Reduce size of the data files

fingolfin commented 6 months ago

Currently there are 5 compressed data files (QUIMP[1-5].tar.bz2) in the repository which take up 9-69 MB each for a total of 190 MB. The user has to extract them for a total of 770 MB.

This should be reduced. Several ideas for this which can be combined.

First off, GAP can transparently access .gz files, this would suggest storing not e.g. lib/QUIMP_336.g but rather lib/QUIMP_336.g.gz in the archives, so that disk space usage is reduced for the end user. The result is "only" 270 MB

This would in fact allow shipping the files "directly" to the user, without a need for .tar.bz2 files. These could then also be removed from the repository which would be better anyway; we could instead keep the lib/QUIMP_*.g files in the repository directly (and compress them on the fly for releases, which we already do for multiple other packages)

Next, the content of the lib/QUIMP_*.g files could be optimized further.

@aniemeyer suggest that for many groups a good way to compress them is to store them via generators in a different, minimal degree representation; and then store generators of a subgroup such that the coset action on the subgroup gives the actual QUIMP permutations. Indeed, take for example QuimpGroup(4080,1). In the file lib/QUIMP_4080.g it takes up more than 0.5 MB space. But it is $A_{17}$ in disguise. So one could replace the generators by the information "this is A17" plus generators for the point stabilizer:

gap> G := QuimpGroup(4080,1);
<permutation group with 2 generators>
gap> IsAlternatingGroup(G);
true
gap> Size(G);
177843714048000
gap> Size(AlternatingGroup(17));
177843714048000
gap> iso:=IsomorphismGroups(A,G);;
gap> S:=PreImages(iso,Stabilizer(G,1));
Group([ (2,12,15,3,9), (1,16,8,6,9,3,2,11,17)(12,13,14), (1,16,17,13,11), (1,16,13,11)(2,4,14,17) ])
gap> SmallGeneratingSet(S);
[ (1,12,15,14,11,2,8,6,3,16,13,9)(4,17), (1,13,12)(2,3,11,16,6)(4,15,14)(8,17,9) ]

fingolfin commented 6 months ago

To stay with the QuimpGroup(4080,1) example: in each entry, three groups are stored:

QUIMP_4080[1][1] is the group itself;
QUIMP_4080[1][3] it the socle
QUIMP_4080[1][4] is... perhaps the group T if the socle is T^k? But I didn't see any references to this in the code.

@DominikBernhardt is the format of the data files documented somewhere?

Anyway, in this specific example all three groups are the same. I think the socle should always be expressed in terms of the generators of the full group, perhaps via words in the generators. Doing so, I think this > 500kb entry could be shrunk by a factor 500.

It won't be as dramatic everywhere, but I am hopeful we can reduce by at least an order of magnitude.

fingolfin commented 6 months ago

For the socle, we can in fact just store (information about) a normal generating set, to be fed into NormalClosure. If the socle is $T^k$ then often it will suffice to store generators for $T$.

fingolfin commented 6 months ago

Also, for the name field, at least for the AS cases, it seems the content is just what IsomorphismTypeInfoFiniteSimpleGroup gives us. In that case I don't see a point in storing that, I'd just compute it on the fly.

gap-packages / QuimpGrp

Reduce size of the data files #11