Closed jesselansdown closed 9 months ago
The archive on Zenodo is unnecessarily large. It contains many .gz files which are not actually compressed. At the very least, all PrimitiveGroups_6561_*.g.gz
files are uncompressed. But there are more, e.g. PrimitiveGroups_5329_181.g.gz
.
There is also a 5MB .DS_Store
file.
Just fixing the above reduces the data size to 1.7GB (I used options -9
and -n
for gzip; the latter removes time stamps and filenames, which saves a few more bytes but also helps with reproducibility of the data).
It is a bit annoying to clean this up with a script because there is a single directory containing ~29777 files. Various filesystems have trouble with such large directories. A natural way to avoid this would be to add subdirectories for each degree.
Turns out 20040 out of the whole 29776 are not actually compressed.
After uncompressing them all, the result actually only takes 3.0G on my disk according to du -h
.
Recompressing with gzip -n
produces a directory of size 1.3G (1,315,672 k -- though this will depend on the filesystem).
I'll try recompressing all files using zopfli
(in gzip mode, so that it can still be read by GAP) which should save even more space, but is slow, so it'll take some time.
With zopfli it goes down to ~1.0G
Hi Max, thanks for the suggested changes. I have accepted most of them already. There is one more that I need to address but it will require me to make some modifications. I'll do so shortly. Also, the files are supposed to be compressed... So I will make sure they are compressed and update the Zenodo repository once I have done so.
I have made sure the files are properly compressed (using zopfli) and udated the Zenodo arxiv. It is now just over 1GB. I have addressed each of the other suggested edits. Is everything ok now?
Merging #52 (699951e) into master (746f67a) will decrease coverage by
0.01%
. The diff coverage is90.18%
.
Thanks @jesselansdown I was busy elsewhere and forgot about this PR. It looks fine now!
On the long run I'd like to transition those data files to a new file format that maybe does not depend on parsing GAP code (i.e. so that other system could import it). Perhaps it would even be possibly to find a common format for old and new data. There are more things (making the data available with a finer granularity so that only parts which are needed can be fetched online on demand; perhaps even integrating this into a "real" DB, a website, etc.)
All of that is not meant to block this PR, just saying what I have in mind for future work.
Thanks @jesselansdown I was busy elsewhere and forgot about this PR. It looks fine now!
On the long run I'd like to transition those data files to a new file format that maybe does not depend on parsing GAP code (i.e. so that other system could import it). Perhaps it would even be possibly to find a common format for old and new data. There are more things (making the data available with a finer granularity so that only parts which are needed can be fetched online on demand; perhaps even integrating this into a "real" DB, a website, etc.)
All of that is not meant to block this PR, just saying what I have in mind for future work.
No worries, glad everything is ok now! My main concern was to make the data available and compatible with the current library so that people can begin to use it already, but I agree that the data format could be improved in the future.
The primitive groups have been made available on Zenodo and additional properties computed to make them compatible with PrimGrp. The ability to load these groups have been added to PrimGrp and the other functions modified as needed to accommodate the new groups.