apjanke / octave-tablicious

Table (relational, tabular data) implementation for GNU Octave
https://apjanke.github.io/octave-tablicious/
GNU General Public License v3.0
28 stars 11 forks source link

calendarDuration colon operator bad results: over hi bound, wrong month-end normalization, etc #135

Open jgpallero opened 3 weeks ago

jgpallero commented 3 weeks ago

See this example:

>> fecha1 = datetime([2024 2 28 21 0 15]);
>> fecha2 = datetime([2024 3 2 21 0 15]);
>> vec2 = fecha1:2:fecha2
vec2 =
28-Feb-2024 21:00:15   01-Mar-2024 21:00:15
>> incd = days(2);
>> vec2 = fecha1:incd:fecha2
vec2 =
28-Feb-2024 21:00:15   01-Mar-2024 21:00:15

When I create a sequential vector of dates using the colon operator all works apparently well as the vector stops in an element lower than the last date.

But when I use years the result is not correct:

>> fecha1 = datetime([2020 2 29 21 0 15]);
>> fecha2 = datetime([2025 1 1 0 0 0]);
>> incy = calyears(2);
>> vec1 = fecha1:incy:fecha2
vec1 =
29-Feb-2020 21:00:15   01-Mar-2022 21:00:15   01-Mar-2024 21:00:15   01-Mar-2026 21:00:15

in Matlab the result is

>> fecha1 = datetime([2020 2 29 21 0 15]);
>> fecha2 = datetime([2025 1 1 0 0 0]);
>> incy = calyears(2);
>> vec1 = fecha1:incy:fecha2
vec1 = 
  1×3 datetime array
   29-Feb-2020 21:00:15   28-Feb-2022 21:00:15   29-Feb-2024 21:00:15

the last element is always <= the greater limit. The bug is also present using calmonths()

I don't know if this bug is related with #134


Andrew's notes

Concise repro code:

datetime([2020 2 29 21 0 15]):calyears(2):datetime([2025 1 1 0 0 0])

TODO

apjanke commented 3 weeks ago

Interesting. This looks like it's somewhat about how Leap Day is handled when adding a calendar-year increment.

>> fecha1 = datetime([2020 2 29 21 0 15]);
>> fecha2 = datetime([2025 1 1 0 0 0]);
>> incy = calyears(2);
>> vec1 = fecha1:incy:fecha2
vec1 =
29-Feb-2020 21:00:15   01-Mar-2022 21:00:15   01-Mar-2024 21:00:15   01-Mar-2026 21:00:15

That last vec1(4) of 2026-03-01, higher than fecha2's 2025-01-1, is just wrong. I bet there's an off-by-one bug in the end condition there.

As for the other differences: 2020 was a leap year, so Feb 29 2020 existed. But there's no Feb 29 in 2022. My code is normalizing it up to March 1, like what happens when you pass "denormalized" out-of-range values to a datetime() constructor call.

>> datetime(2020, 2, 29)
ans = 
  datetime
   29-Feb-2020

But it looks like when adding a calyear (or calmonth), Matlab normalizes the nonexistent Feb 29 leap day down to Feb 28 instead, on non-leap years.

I wonder if there's documentation or reference for this behavior. The Matlab doco page for `calyears says "Calendar years account for leap days when used in calendar calculations", but it doesn't say how it accounts for them. (Same for the calmonths and calendarDuration pages, to my eyes.)

Matlab datetimes and datenums are ISO calendar dates, I believe. I wonder if the ISO date/time standards specify how calendar interval arithmetic works, and how to handle leap years in them.

The other aspect of this seems to be how calendar intervals work when used as the middle operand in a "lo:d:hi" two-colon expression. It is surprising to me that vec1(3) in your Matlab example is Feb 29. I would have thought it would be Feb 28, even though 2024 is a leap year. That's because I thought that the "lo:d:hi" two-colon operator generally worked like repeatedly adding d to a working value initialized with lo, copying successive results into elements of the output, until the next value would be larger than hi. Like this:

% y = lo:d:hi - how I thought it worked (approximately)
y = repmat(lo, [0 0]);
tmp = lo;
while tmp <= hi
  y(end+1) = tmp;
  tmp = tmp + d;
end

In this case, once tmp is normalized to a Feb 28 (or March 1) for a non-leap year, I'd expect it to never turn back in to Feb 29, because adding a d of 1 calyear to a Feb 28 will always produce a Feb 28.

But instead, it looks like it's doing y(s) = lo + (d * s); ("s" for "step" here). I looked at Matlab's colon doco, and it seems that's maybe the right way.

x = j:i:k creates a regularly-spaced vector x using i as the increment between elements. The vector elements are roughly equal to [j,j+i,j+2i,...,j+mi] where m = fix((k-j)/i).

That sounds more like y(i) = lo + (s * d) than repeatedly doing tmp = tmp + d. (It doesn't result in a "regularly-spaced" vector in the case of datetime and a calyear/calmonth i, but close enough.)

(Octave's documentation for colon isn't as detailed and doesn't seem to provide any guidance here.)

Fixing the upper bound bug is straightforward. I'll go ahead and do that.

I want to do a bit more reading to better understand the definition/specification here around the leap day normalization and "stride". (If you have any relevant links or books, I'd be happy to hear them.) But it sounds like Tablicious here needs two changes for Matlab compatibility: have datetime + calyears normalize down instead of up for nonexistent Feb 29s, and change the colon operator to do y(s) = s * d instead of repeatedly adding d to a working value.

apjanke commented 3 weeks ago

BTW...

I don't know if this bug is related with https://github.com/apjanke/octave-tablicious/issues/134

Seems unrelated. #134 is about string parsing, and the "colon arguments" in its error message is just about using a colon with numerics like i = 1:size(x); it's not related to how datetime and calendarDuration implement colon support.

jgpallero commented 3 weeks ago

About the differences with Matlab in the leap day topic, I think the better option is to match the behavior of Matlab. Some developers have said sometimes that Matlab-compatibility is a goal (see, for example, the comment of Philip Nienhuis (comment number 9) in https://savannah.gnu.org/bugs/index.php?60107)

apjanke commented 3 weeks ago

About the differences with Matlab in the leap day topic, I think the better option is to match the behavior of Matlab.

Yeah, Matlab compatibility is a main goal for Tablicious too, and that's almost certainly the right thing to do here. I just wish I knew why Matlab was doing it this way, and the issues around it in the "calendrical computations" area, so I could do a more complete fix. I like to have a decent understanding of the code I'm writing and the behavior it's trying to do. That way, I may be able to catch other related issues in that area (like, maybe leap seconds or daylight saving time have similar effects) and get them right too, maybe in an "elegant" way in terms of the classes' internal data representation, instead of fixing individual problems in a patchwork manner as they're discovered.

In particular, maybe this "leap day / add 1 calyear" behavior isn't limited to how datetime.plus works, but somehow involves the internal behavior of how the semantics of datetime constructors, datenum(), and calendarDuration interact, in some maybe-undocumented manner. Like, should this behavior be accomplished in a more general manner than sticking a special-case "if date 1 is Leap Day, and the resulting date is not a leap year, subtract 1 day to make it Feb 28" bit into datetime.plus and datetime.minus?

Either way, I plan to make this behavior change to try to match Matlab, even if I can't find more thorough documentation on it.

apjanke commented 3 weeks ago

Okay, I think this is actually a bigger issue than just leap days. I think it applies to any calyears or calmonths calendarDuration addition that would end up with a "denormalized" date where the day-of-month value is larger than the number of days in that month. I think Matlab's caldur behavior "caps" or rounds it down to the last day within that month, and then preserves the time-of-day part.

For example, January has 31 days. February has either 28 or 29 days. So, take 2004-01-31 and add 1 calendar month, or 1 cal year and 1 cal month. Tablicious currently does this:

>> datetime(2004, 1, 31, 3, 4, 5) + calmonths(1)
ans =
 02-Mar-2004 03:04:05
>> datetime(2004, 1, 31, 3, 4, 5) + (calmonths(1) + calyears(1))
ans =
 03-Mar-2005 03:04:05
>>

But I think, to match Matlab, it should produce Feb 29 2004 and Feb 28 2005 instead.

This is different from how the datenum function and datetime constructor work, where denormalized days "carry over" or advance the value into the next month. E.g., datetime(2005, 2, 31) produces 2005-03-03.

The logic

So, I think the fix here is to change the calendarDuration addition logic to be:

  1. Convert the date values to datevec or equivalent, in dv.
  2. Add the Years and Months componenents to those parts of dv.
  3. Don't call datenum() or datetime() to convert the result yet!
  4. Check dv for any dernormalized (y, m, d) combinations where day is higher than the last day of that month, and set those days to the actual last day of that month.
  5. Then add the caldays component. This time, let resulting "denormalized" values roll over into the next or previous month(s) instead of capping to the last day of month.
  6. I think here, you need to normalize, maybe by converting back to datetime, to handle DST and leap seconds and the like.
  7. Then add the Time component. Again, let denormalized values roll over.
  8. Convert the result to datetime if it's not already of that type.

And subtraction would do the same thing, except multiplying all the caldur components by -1 first.

I think this agees with the Matlab calendarDuration doco page which says, in the Tips section, "When you add a calendarDuration array that contains more than one unit to a datetime, MATLAB® always adds the larger units first.".

Note that this means that calendarDuration values do not have the normal commutative and associative math behavior that regular numbers do. (Even beyond floating-point rounding issues.) These following operations will not necessarily produce the same result.

t + calyears(1) + calmonths(2) + caldays(1)
t + caldays(3) + calyears(1) + calmonths(2)
t + (caldays(3) + calyears(1) + calmonths(2))

Data structure

While I'm in here, I think Tablicious's calendarDuration structure can be refactored to simplify it. The Sign property can be removed, letting the sign of the Years/Months/Days/Time numeric values handle it. Plus, Years and Months can be combined into a single Months property, since here a calyear is defined to be 12 calendar months. Splitting them in to years, months, and quarters can be done as part of the display formatting instead. I think that'll make for simpler code in the implementation, including in the display formatting. Plus, smaller data structure using less memory, though that probably doesn't matter, because who's going to be making big calendarDuration arrays? Unless we do so as a temporary value in the colon() implementation.

Maybe drop the IsNan property too, and just carry NaN values in the Months/Days/Time properties, normalizing them to all be NaN when any is set to NaN. They're private-access, so we can control that.

Fractional cal year and month values

And I think we need to prohibit fractional (non-integer) values for calyears and calmonths components. Because they're variable-length things, doing fractional operations on them may not be well-defined, or if they are, may not really be sensible, or at least would be pretty difficult to implement. I think Matlab prohibits them. We should too, to match that behavior, and keep our code and behavior simpler.

Like, this here. That shouldn't be a thing.

>> calmonths(0.5)
ans =
0.5mo
>>
jgpallero commented 3 weeks ago

For example, January has 31 days. February has either 28 or 29 days. So, take 2004-01-31 and add 1 calendar month, or 1 cal year and 1 cal month. Tablicious currently does this:

>> datetime(2004, 1, 31, 3, 4, 5) + calmonths(1)
ans =
 02-Mar-2004 03:04:05
>> datetime(2004, 1, 31, 3, 4, 5) + (calmonths(1) + calyears(1))
ans =
 03-Mar-2005 03:04:05
>>

But I think, to match Matlab, it should produce Feb 29 2004 and Feb 28 2005 instead.

In Matlab (R2024a) the results are

>> datetime(2004, 1, 31, 3, 4, 5) + calmonths(1)
ans = 
  datetime
   29-Feb-2004 03:04:05
>> datetime(2004, 1, 31, 3, 4, 5) + (calmonths(1) + calyears(1))
ans = 
  datetime
   28-Feb-2005 03:04:05

Like, this here. That shouldn't be a thing.

>> calmonths(0.5)
ans =
0.5mo
>>

calyears() and calmonths() in Matlab works only with integer values of years and months. There ir also a caldays() function

>> calyears(9)
ans = 
  calendarDuration
   9y
>> calmonths(9)
ans = 
  calendarDuration
   9mo
>> caldays(9)
ans = 
  calendarDuration
   9d
>> calyears(9.1)
Error using calyears (line 38)
Input data must be a numeric array containing real, integer values. Use the CALMONTHS or CALDAYS functions to create calendar durations shorter than one year, or use the YEARS function to
create fractional fixed-length years as durations. 
>> calmonths(9.1)
Error using calmonths (line 30)
Input data must be a numeric array containing real, integer values. Use the CALDAYS function to create calendar durations shorter than one month. 
>> caldays(9.1)
Error using caldays (line 38)
Input data must be a numeric array containing real, integer values. Use the HOURS, MINUTES or SECONDS functions to create durations shorter than one day, or use the DAYS function to create
fractional fixed-length days as durations.
apjanke commented 3 weeks ago

Okay, I think I've got a fix for the bogus upper-bound behavior over on branch caldur-math-fixes, in https://github.com/apjanke/octave-tablicious/commit/b26975f1c59b7c77e1dcce28cfee5f81eb353a1a. This is a partial fix for this bug report.

This fixes basically an off-by-one error in the upper-bound test and loop logic in the calendarDuration part of datetime.colon.

This also changes the datetime.colon logic for calendarDuration to multiply the inc (caldur) and add it to lo for each successive element, to get rid of "jigger" from varying-length periods. This will make it respect the month-end "capping" normalization, once that gets added to datetime.plus.

Also adds the missing mtimes method.

Here's the new behavior. Look better to you?

>> datetime([2020 2 29 21 0 15]):calyears(2):datetime([2025 1 1 0 0 0])
ans =
29-Feb-2020 21:00:15   01-Mar-2022 21:00:15   29-Feb-2024 21:00:15
>> dt = datetime([2020 2 29 21 0 15])
dt =
 29-Feb-2020 21:00:15
>> (dt:calmonths(1):(dt+calmonths(14)))'
ans =
29-Feb-2020 21:00:15
29-Mar-2020 21:00:15
29-Apr-2020 21:00:15
29-May-2020 21:00:15
29-Jun-2020 21:00:15
29-Jul-2020 21:00:15
29-Aug-2020 21:00:15
29-Sep-2020 21:00:15
29-Oct-2020 21:00:15
29-Nov-2020 21:00:15
29-Dec-2020 21:00:15
29-Jan-2021 21:00:15
01-Mar-2021 21:00:15
29-Mar-2021 21:00:15
29-Apr-2021 21:00:15
>>

Notice how the 01-Mar normalization toward the end didn't "stick" for subsequent elements.

Should add some regression tests for this, but I'm short on free time.

jgpallero commented 3 weeks ago

Thank you for the fix!

The results of your examples in Matlab are:

>> datetime([2020 2 29 21 0 15]):calyears(2):datetime([2025 1 1 0 0 0])
ans = 
  1×3 datetime array
   29-Feb-2020 21:00:15   28-Feb-2022 21:00:15   29-Feb-2024 21:00:15
>> dt = datetime([2020 2 29 21 0 15])
dt = 
  datetime
   29-Feb-2020 21:00:15
>> (dt:calmonths(1):(dt+calmonths(14)))'
ans = 
  15×1 datetime array
   29-Feb-2020 21:00:15
   29-Mar-2020 21:00:15
   29-Apr-2020 21:00:15
   29-May-2020 21:00:15
   29-Jun-2020 21:00:15
   29-Jul-2020 21:00:15
   29-Aug-2020 21:00:15
   29-Sep-2020 21:00:15
   29-Oct-2020 21:00:15
   29-Nov-2020 21:00:15
   29-Dec-2020 21:00:15
   29-Jan-2021 21:00:15
   28-Feb-2021 21:00:15
   29-Mar-2021 21:00:15
   29-Apr-2021 21:00:15

Apparently there is still present the problem of 28-Feb or 01-March when dealing with leap years and durations of a whole year

apjanke commented 2 weeks ago

Apparently there is still present the problem of 28-Feb or 01-March when dealing with leap years and durations of a whole year

That's expected! I haven't gotten to that part yet. Need to make a modification to datetime.plus to deal with it by doing "month-end capping". But this change to colon means that once I've done that, colon will respect it and should produce the correct result.

apjanke commented 2 weeks ago

Okay, I think I've got a fix for the last-day-of-month capping, plus a couple other things, in https://github.com/apjanke/octave-tablicious/commit/87f6afd5974845fb6c7a4d0fd4dcfdab42be9ba6.

What do you think of this behavior?

>> dt = datetime(2020, 2, 29)
dt =
 29-Feb-2020
>> [dt:calmonths(1):datetime(2021,4,1)]'
ans =
29-Feb-2020
29-Mar-2020
29-Apr-2020
29-May-2020
29-Jun-2020
29-Jul-2020
29-Aug-2020
29-Sep-2020
29-Oct-2020
29-Nov-2020
29-Dec-2020
29-Jan-2021
28-Feb-2021
29-Mar-2021
>>
>> dt = datetime(2020, 1, 31)
dt =
 31-Jan-2020
>> [dt:calmonths(1):datetime(2021,4,1)]'
ans =
31-Jan-2020
29-Feb-2020
31-Mar-2020
30-Apr-2020
31-May-2020
30-Jun-2020
31-Jul-2020
31-Aug-2020
30-Sep-2020
31-Oct-2020
30-Nov-2020
31-Dec-2020
31-Jan-2021
28-Feb-2021
31-Mar-2021
>>

Found a couple more problems while I was in there. Notably, the lo:inc:hi behavior, when hi is less than lo and inc is effectively negative, still just produces an empty array. Need to add a special case to handle that.

jgpallero commented 2 weeks ago

Looks great! Thank you very much!