JanMarvin / readsas

Read the SAS file formats
https://janmarvin.github.io/readsas
GNU General Public License v2.0
3 stars 0 forks source link

Bug in read.sas #39

Closed vpprasanth closed 1 year ago

vpprasanth commented 1 year ago

Trust this email finds you well. By the way, thanks a ton for you and your team's relentless work to empower the R community and for all the contributory packages. No doubt, we all are indebted to you.

Having said so, I would like to draw your attention to fixing a bug (I presume so) in readsas package (I have seen the same issue in the haven package as well) while importing sas7bdat files into the R environment. That is, if the sas7bdat file contains a date variable and is holding a value 07/07/7777 (say). Then if we import the same into R, it reads it as 7777-07-06. I could understand the change in format. However, I am a bit baffled with the change in value over here.

It would be nice if we can have an option to read all the variables (especially the date field) as characters or "as it is", than changing the class by default.

Your thoughts and inputs are highly appreciated.

Much Obliged, Prasanth.

JanMarvin commented 1 year ago

Hi @vpprasanth , thanks for your kind words. Indeed the conversion returns 06Jul7777:

readsas::convert_to_date(2124803)

Maybe SAS or R are not aware of a leap year in the upcoming 5000 years. Hopefully, unless you are a science fiction author, this does not matter. Thanks for raising awareness, but I don't think that there's anything wrong in our code.

JanMarvin commented 1 year ago

Had a moment, to dig into this. In SAS the year 4000 is no leap year, in R it is.

readsas::convert_to_date(745154). According to the internet R is correct. Though to completely replicate SAS, you can add a day after February 4000.

JanMarvin commented 1 year ago

You can read as is with read.sas(..., convert_dates = FALSE). This way the underlying numeric values are returned.

vpprasanth commented 1 year ago

Hello Marvin,

Thanks much for looking into it and pondering your thoughts. I see your point on the leap-year. However, it doesn't matter whether it is a leap year or not, the idea is to read the SAS data that contains the value 07/07/7777 into R precisely.

By the way, in clinical trials, these types of data (07/07/7777) are used to characterize "not applicable". similarly, we make use of 09/09/9999 to refer it as a missing value in the date field.

JanMarvin commented 1 year ago

Hi @vpprasanth , I'm not really following, but that migh be because of my lack of sleep, it's warm here :hot_face: . The issue is that the SAS calendar is missing a day.

data tmp;
  format time date9.;
  time = "29Feb4000"d; * breaks, SAS does not know this day;
run;

So whenever we are picking any day after February 4000, we miss this day. Days are numeric values, starting 1st January of 1960 at 0, 2nd January of 1960 is 1.

> readsas::convert_to_date(0)
[1] "1960-01-01"
> readsas::convert_to_date(1)
[1] "1960-01-02"

Now at the end of February 4000:

> readsas::convert_to_date(745153:745155)
[1] "4000-02-28" "4000-02-29" "4000-03-01"

And in SAS the leap year is ignored and the same results in "28Feb4000"d, "01Mar4000"d, "02Mar4000"d. So from here on all dates are one off. But I guess it wont do any harm, if we do a silent conversion and drop 29Feb4000 too. I'll see if the same is true for datetime and will add a fix. In the meantime you could simply apply something like this

x <- readsas::convert_to_date(745153:745155)
x[x > "4000-02-28"] <- x[x > "4000-02-28"] + 1
vpprasanth commented 1 year ago

Hello Marvin,

Again it’s not the issue with the leap-year. Imagine the following table written in SAS. Now, if we import the same table in R, we are not getting the exact value as 07-07-7777

Sub_ID | Date | BMI 101 | 23-12-2022 | 22 102 | 12-07-2021 | 23 103 | 07-07-7777 | 18 104 | 10-08-2022 | 7777 105 | 07-07-7777 | 7777 106 | 07-07-7777 | 21 107 | 22-04-2022 | 20 108 | 17-02-2021 | 7777 109 | 09-09-9999 | 19 110 | 09-09-9999 | 9999

JanMarvin commented 1 year ago

Cloud you reformat this?

Year 7777 and 9999 are after 4000. In SAS the year 4000 is missing a day. Therefore, if we create the date information from an always increasing integer sequence, every year after is missing a day. 7777 is missing a day and because of this you see in R 7777-07-06. SAS assumes that 2124803 is the correct numerical value of 07Jul7777, but unfortunately SAS is wrong. The correct value would be 2124803 + 1 (+1 is the missing leap year in 4000).

The R code is right, but since it does not know that SAS is missing a day, the result is one off for every day after February 4000. It's a simple fix to let our results match those of SAS, but it is SASs bug, not our. But obviously you are right, what good is a file reader, if it does not handle the quirks of its input :smile:

JanMarvin commented 1 year ago

A fix is in main now. Thanks!

vpprasanth commented 1 year ago

bug.zip

Please find attached the zip file that contains the following: a. txt_data b. sas_data

Here is the SAS code used for generating the sas7bdat file (sas_data) from the txt_data.

data sas_data; informat Sub_ID 5. Date ddmmyy10. BMI 5.; format Date ddmmyy10.; infile "/home/u63128400/txt_data.txt" missover; input Sub_ID Date ddmmyy10. BMI ; run;

libname out "/home/u63128400/"; data out.sas_data; set sas_data; run;

Now, if you open the sas_data (the sas7bdat file) in SAS, you could see the date values as 07/07/7777. However, if you open the same sas_data (the sas7bdat file) in R using haven, you will see 7777-07-06. This is a mismatch.

JanMarvin commented 1 year ago

Thanks for the file, it is fixed in main. Added year 8000 leap day fix:

remotes::install_github("JanMarvin/readsas")
#> Downloading GitHub repo JanMarvin/readsas@HEAD
#> 
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#> * checking for file ‘/tmp/RtmpKHbaff/remotes83911e9e4e0a/JanMarvin-readsas-adf884d/DESCRIPTION’ ... OK
#> * preparing ‘readsas’:
#> * checking DESCRIPTION meta-information ... OK
#> * cleaning src
#> * checking for LF line-endings in source and make files and shell scripts
#> * checking for empty or unneeded directories
#> * building ‘readsas_0.5.tar.gz’
#> Installing package into '/home/jmg/R/x86_64-pc-linux-gnu-library/4.3'
#> (as 'lib' is unspecified)
library(readsas)
readsas::read.sas("/tmp/sas_data.sas7bdat")
#>    Sub_ID       Date  BMI
#> 1     101 2022-12-23   22
#> 2     102 2021-07-12   23
#> 3     103 7777-07-07   18
#> 4     104 2022-08-10 7777
#> 5     105 7777-07-07 7777
#> 6     106 7777-07-07   21
#> 7     107 2022-04-22   20
#> 8     108 2021-02-17 7777
#> 9     109 9999-09-09   19
#> 10    110 9999-09-09 9999
vpprasanth commented 1 year ago

Thanks much for the quick fix...

Much Appreciated, Prasanth.

vpprasanth commented 1 year ago

By the way, SAS goes wrong with the leap year and it seems that it's an existing problem...!!! https://blogs.sas.com/content/sasdummy/2010/04/05/in-the-year-9999/

JanMarvin commented 1 year ago

That's what I was trying to tell you 😉

vpprasanth commented 1 year ago

Oops…! My bad… couldn’t follow you properly… Anyway, thanks much for all the help…👍