cdisc-org / DataExchange-DatasetJson

DatasetJson for ODM V2
MIT License
15 stars 3 forks source link

R Date Conversion in Dataset-JSON #15

Closed cpiraux closed 3 months ago

cpiraux commented 1 year ago

I recently worked on the latest version of the xportr R package, which is designed to convert R datasets to xpt format. During this process, we encountered difficulties converting R dates to SAS date format. (Dates are not correctly derived (SAS vs R) · Issue #142). We might encounter the same issues when converting to Dataset-JSON.

The challenges we faced with the conversion process included:

As a result of the R to SAS conversion, we encountered issues such as date variables appearing 10 years earlier than expected and variables not being recognized as dates.

Currently in the dataset-JSON specification, it is not clear for me that we should follow the SAS timestamp for numeric date variables. I also think if we don’t know that variable ending with --DT, --DTM and --TM are date(time) numeric variables and if “displayFormat” is missing (R does not have date format, in xportr package we add the ones from SAS), it is difficult to know that the variable is a numeric date.

image

I came across an informative webpage on JSON date formats that I found interesting: JSON date format: 3 ways to work with dates in JSON. Considering the issues we faced, I suggest utilizing the ISO 8601 Date string format for handling dates in dataset-JSON with a "type" attribute with values like 'numeric date'. It would be easier to convert dates from any programming language and reduce the risk of errors.

I would appreciate your thoughts on numeric date in Dataset-JSON, and how it would be possible to facilitate the conversion of dates between R and Dataset-JSON?

DmitryMK commented 1 year ago

Hi Celine, these are really good question.

There is a general goal to make Dataset-JSON a universal format and one of the requirements for it is to make it agnostic to a programming languages, so that in can be used for SAS/R/Python or any other programming language. Although I admit that displayFormat can be seen as SAS legacy.

I see your point and logic in using ISO 8601 Date string format for numeric dates and it still follows the "language agnostic" rule, but I'm not sure we want to change it at this stage. And I see it more a metadata task, rather than data-representation task,

One of the goals was to be able to review the data without Define-XML, which is why we have labels/types and other "metadata" attributes in it. But it was never an intention to have absolutely everything in Dataset-JSON. We intentionally did not add the date (which corresponds to character date) type from Define-XML, because Dataset-JSON is focused on data and not metadata and it is not intended to replace Define-XML.

Per ADaM-IG numeric date variables should end with --DT/--DTM/--TM, so I would expect that for the vast majority of cases it should be followed. And if we want to have a numeric date variable with other suffix, I would assume that it should be metadata (define.xml) where this information will be specified. As a matter of fact, there is no way it can be specified in Define-XML now, so it would be a good comment for the future Define-XML version.

Regarding epoch it is actually an interesting fact, which I never though about before. Looking into CDISC standards, I could not find it specified anywhere what should be used as an epoch for numeric dates. De facto the industry is using 1/1/1960, because mostly industry used SAS for the submissions, I believe that this should be specified somewhere in the standards or we can have a metadata field explaining what is used as an epoch for numeric dates in this submission.

I would suggest, that by default the following assumption can be used when reading Dataset-JSON:

I will raise this topic within the ODM team, to see if there are other thought about it.

mstackhouse commented 1 year ago

@cpiraux in the current version of the datasetjson R package we handle this conversion by inferring dates and datetimes from the display format, and then handle the conversion. The code we used for this is here: https://github.com/atorus-research/datasetjson/blob/3e7f4bb17ad0215b72fa4c87bc28017977499dbe/R/date_time_conversions.R#L43

It was discussed at PHUSE CSS that it's likely best to update the standard to something unambiguous, such as character ISO8601 dates/datetimes.

cpiraux commented 1 year ago

Thank you, @mstackhouse , for sharing the link to the dataset JSON package. I'm considering trying out the package for the Dataset-JSON Pilot.

I'm pleased to hear that date conversion has been a hot topic of discussion at the PHUSE CSS and that the standard will be updated to something more unambiguous, such as character ISO8601 dates/datetimes.

With these improvements in mind, it seems we can consider closing the issue.

lexjansen commented 3 months ago

@cpiraux Since we are addressing this issue in Dataset-JSON v1.1, I suggest closing this issue.

cpiraux commented 3 months ago

Yes, sure. Thank you for addressing this issue. I will go ahead and close it now.