dgkf / parttime

Work-in-progress R package for handling partial datetimes
https://dgkf.github.io/parttime
Other
17 stars 1 forks source link

Handling for missing-in-the-middle datetimes #9

Closed dgkf closed 2 years ago

dgkf commented 2 years ago

As mentioned in https://github.com/pharmaverse/admiral/issues/1315, there are situations where dates might have "missing-in-the-middle" missingness - that is, a coarse datetime field is missing, but some of the finer fields are still defined (eg "2022-NA-04").

I'll try my best to articulate what I think would be expected behaviors, but I would appreciate any input if there's another way to look at things (@billdenney - your input would be especially valued here).

Comparison Operators

2022-NA-04 > 2022-05-04 NA (month could be gt or lt)
2022-NA-04 > 2022-01-01 TRUE (even minimum imputed month cant be lt)
2022-NA-04 > 2022-12-31 FALSE (even maximum imputed month cant be gt)

Subtraction (creating partial timespans)

Just doing some rough back-of-the-envelope examples to draft behaviors - exact math might be a bit off and doesn't account for timezones or leap-anything.

The rough process looks like this:

  1. impute minimum and maximum possible datetimes for each component
  2. timespan is from $lhs{min} - rhs{max}$ to $rhs{max} - lhs{min}$ (it's a little more involved when the parttimes overlap, but this is a good starting point)
  3. Display as the average of the upper and lower bounds of the timespan with an uncertainty of half the range ("offset +/- uncert")
min max offset +/- uncert
2022-NA-04 - 2021-01-01 1year 3days 1year 331days 1year 117days +/- 114days
2022-NA-01 - 2022-NA-04 -337days 331days 3days +/- 334days

min/max

I think this one is straightforward - just use largest/smallest imputed possible times.

max(2022-NA-04, 2022-07-01) 2022-12-04
max(2022-NA-05, 2022-12-31) 2022-12-31
billdenney commented 2 years ago

Thanks for thinking through these. You've thought of most of the cases that I've thought of, and I agree with your methods. One other consideration is how do you add to these times?

As an example, what is 2022-NA-04 plus 4 days? That would be pretty clearly 2022-NA-08.

But, what is 2022-NA-28 plus 1 day? That would depend on the month. In February, it would be March. In any other month, it would be 2022-NA-29. But it would also have a clear effect on the minimum and maximum since those would now clearly be 2022-01-29 and 2022-12-29.

My best guess is that addition is not supported or that it would need very specific requirements. I brought it up as something that I'd thought of before and wanted to be sure it was considered, at least.

dgkf commented 2 years ago

But, what is 2022-NA-28 plus 1 day?

Thanks for highlighting this. It's definitely non-trivial and gets even further complicated by leap years/seconds.

At least as it is now, if a value is imputed with an invalid date it gets squished to the valid range.

> impute_time(as.parttime("2022--29"), "2022-02-02")
<partial_time<YMDhms+tz>[1]>
[1] "2022-02-28"

I haven't ended up in this sort of situation myself, and my suspicion is that there could end up being quite a few competing "correct" behaviors in this space.

Without knowing more about all the ways this might affect an analysis, I'd lean toward leaving this as undefined behavior and suggest that a user first separate the datetime components themselves and apply use-case-specific logic. But I could certainly add a utility function to check whether any datetimes in a vector fall into this category where they have missingness in the middle.

billdenney commented 2 years ago

Does parttime support leap seconds accurately? (By "accurately", I mean only allowing for real leap seconds? And also only allowing them in time standards that support them which I think means limiting them to TAI time?) I ask because don't think that CDISC supports leap seconds.

dgkf commented 2 years ago

Does parttime support leap seconds accurately?

@billdenney sorry - no it doesn't! I was just throwing that out as an example, but it's currently totally ignored.

When I wrote this, I was using lubridate to handle leap-anything, which only handles years. Now we have the tzdb package, which gives us access to a tz's leap-seconds database, so it would be possible (but perhaps unnecessary for this use case).

billdenney commented 2 years ago

For leap seconds, I think it would be preferred to ignore them in clinical settings because people would be surprised by 86401 second-long days and "60" in the seconds place would be an error.

dgkf commented 2 years ago

@billdenney - Thanks for the heads-up. I think it would be a nice-to-have, but definitely isn't a priority, and if I ever do go down that path I'll be sure to expose some mechanism of disabling it entirely.

dgkf commented 2 years ago

Just a status update. Missing-in-the-middle will now parse into a parttime, but I haven't rigorously tested how they are handled by any of the functions or operators that can apply to parttimes. For now, parsing a missing-in-the-middle time emits a warning that these types of dates are experimental.

billdenney commented 2 years ago

Great! What would be most helpful for testing?

Things that occur to me are: Add test cases, define rules for operation (which you seem to have above), any other things?

My opinions are that missing-in-the-middle date-times should behave similarly to truncated partial times. Overall, it hangs everything on the min() and max() functions and then moves to normal date and date-time handling with the exception of equality.

Comparison operators

Arithmetic operators

dgkf commented 2 years ago

Finally getting around to taking a crack at this.

Generally I agree with the approach you laid out. There are just a few points that I would implement differently.

Equality (==) would generally be FALSE, unless the values and missingness were identical.

Like NA == NA returns NA, even if the missingness is identical, equality between two parttimes with missigness should be NA. If parttimes represent a range of possible values, then they aren't guaranteed to be equal. This can be coerced to TRUE using possibly().

min() and max() would be fully-defined date-times

I think we're saying the same thing and I'm reading min and max here as pseudocode (I started this, so I think this point of confusion is my doing), but just to put a fine point on it, min and max the R functions do not perform imputation. Those are handled via impute_time_min/_max. The implementation for min/max should mirror the logic of min/max(<numeric>) and retain the missingness.

Otherwise I think everything looks good. I'm writing up tests and will try to identify any gaps that need to be addressed to make sure this is supported.

dgkf commented 2 years ago

Going to mark this as done, since the remaining features for difftimes aren't even supported for parttimes yet. I'll kick off a separate issue that is specifically about these cases.