BioStatMatt / sas7bdat

A reverse engineering of the sas7bdat database file format
83 stars 21 forks source link

Notes on the Row Size subheader #14

Open evanmiller opened 3 years ago

evanmiller commented 3 years ago

I'll open a PR on the RST file if I have time, but I'd like to quickly share a discovery about the Row Size subheader that should make everyone's life easier detecting compressed files and also pulling out the Creator strings.

Bytes 344|672 through 380|708 consist of 6-byte text references into Column Text! They have the same structure as the Column Name pointers, but are unpadded: 2 bytes for the index, 2 bytes for the offset, 2 bytes for the length.

Specifically:

Bytes 350|678 through 356|684: Text reference (index, offset, length) into Creator Software string

Bytes 362|690 through 368|696: Text reference (index, offset, length) into Compression string ("SASYZCRL" or "SASYZCR2")

Bytes 374|702 through 380|708: Text reference (index, offset, length) into Creator PROC step name

This should help get rid of the awkward heuristics around detecting data before the column names begin, since now we have exact offsets for these strings. This also helps explain why SASYZCRL appears where it does. (If the Compression string has an offset/length of 0, it means that the file is uncompressed.)

I've implemented this logic in ReadStat, and it allowed me to rip out several lines of code. So far it seems to work well with test files.

As I said, I will try to get around to writing this up more formally, but in the meantime I wanted others to benefit from this small bit of knowledge.

BioStatMatt commented 3 years ago

Excellent. Thank you.