JanMarvin / readsas

Read the SAS file formats
https://janmarvin.github.io/readsas
GNU General Public License v2.0
3 stars 0 forks source link

identify deleted rows #11

Closed JanMarvin closed 2 years ago

JanMarvin commented 5 years ago

Certain sas7bdat files contain rows deleted by the SAS user prior to writing the file. These rows are usually in the middle or at the end of the file. Presumably SAS is lazy removing and repositioning data in output files and instead simply notes lines to be ignored.

Right now these lines are imported by readsas. Therefore the dataset might differ from SAS. The information which row(s) to ignore is assumed to be at the end of case 1.

JanMarvin commented 4 years ago

Research indicates, that the information is stored on a page 640 after the data information. In a synthetic test file (similar to data.frame(x = 1:3)) the following values were found. The comment following the file name is the value, the comment below the function call is the sas call. The test files were created using a x64 SAS. If the hex value is changed to a different value, the SAS output changes. E.g. change 0x40 to 0x80 in test2 and the result will be test3. Not sure, how the values are constructed.

fl <- "../sas7bdat/test2.sas7bdat" # 64
dd <- read.sas(fl, F) # delete x = 2

fl <- "../sas7bdat/test3.sas7bdat" # 96
dd <- read.sas(fl, F) # delete x > 1

fl <- "../sas7bdat/test4.sas7bdat" # 128
dd <- read.sas(fl, F) # delte x = 1

fl <- "../sas7bdat/test5.sas7bdat" # 192
dd <- read.sas(fl, F) # delete x < 3
JanMarvin commented 4 years ago

The value appears to be a double. Either it is the number of the row to be deleted (starting at 0) (e.g., 2) or negative indicating the number of rows to be deleted from the top (e.g., -0 or -2)?

JanMarvin commented 4 years ago

Found another PAGE_TYPE 384, here the last double of the page seems to indicate which of the rows has to be removed. Still the position of the double on PAGE_TYPE 640 remains unknown