Index files: use FITS-standard byte order and table field formats

cgobat commented 9 months ago

The FITS standard mandates the use of big-endian (i.e., MSB-first) byte ordering across the board (see §5). The data in Astrometry.net index files are currently stored in little-endian format, with the FITS data type simply left as naïve byte strings. This not only is unintuitive in light of the FITS standard, it also adds an unnecessary abstraction layer that necessitates extra machine-dependent instructions/documentation, as well as an additional pre-processing step (regardless of one's computer architecture) to convert the bytes into numeric data prior to use. This conversion also cannot even be done entirely programmatically, since the only place the actual data types are described is in the header COMMENTs, meaning FITS reader software doesn't know a priori how to interpret the data without a human setting each type manually.

All of the aforementioned issues can be resolved simply by using the already-existing FITS binary table data type/structure definition keywords to present the data in a FITS-native way. For instance, rather than leaving the TFORM1 parameter for the quads HDU simply as 16A (i.e., 16-byte strings/blobs), setting it to 4J (i.e. sets of four 32-bit integers) and swapping the byte order to be FITS-compliant allows FITS I/O programs to read the numeric array directly, and also provides a more faithful representation of the intent/significance of the data. The same principle can be applied to all of the other HDUs in each file.

See the attached index-4210-modified.fits.gz for an example of this reformatting. Below is a table summary of the updated HDUs contained therein. I've also added EXTNAMEs for easier identification.

#	EXTNAME	Type	Cards	NAXIS2	TFORM1
0	PRIMARY	PrimaryHDU	105
1	QUADS	BinTableHDU	16	580800	4J
2	KD_HEADER_CODES	BinTableHDU	114	0	0A
3	KD_LR_CODES	BinTableHDU	20	32768	J
4	KD_SPLIT_CODES	BinTableHDU	25	32767	I
5	KD_RANGE_CODES	BinTableHDU	32	9	D
6	KD_DATA_CODES	BinTableHDU	19	580800	4I
7	KD_HEADER_STARS	BinTableHDU	91	0	0A
8	KD_LR_STARS	BinTableHDU	20	16384	J
9	KD_SPLIT_STARS	BinTableHDU	25	16383	J
10	KD_RANGE_STARS	BinTableHDU	31	7	D
11	KD_DATA_STARS	BinTableHDU	19	363000	3J
12	SWEEP	BinTableHDU	14	363000	1B
13	J_MAG	BinTableHDU	15	363000	1E

Is there any reason to keep them as-is, as opposed to using FITS-compliant byte order and making them "self-aware" of their own data types?

dstndstn commented 9 months ago

Yes, there is a good reason for this. These files are loaded using the mmap() system call, which maps the contents of the file directly into memory, and together the HDUs hold multiple live KD-tree data structures (codes and stars). Byte-swapping the contents as FITS demands would mean that all the contents have to be byte-swapped upon reading -- pretty painful to implement.

cgobat commented 9 months ago

Okay, fair enough! I'm happy to be wrong as long as there's a good reason for it. 😄

Thanks for the response.

dstndstn / astrometry.net

Index files: use FITS-standard byte order and table field formats #284