BurntSushi / jiff

A date-time library for Rust that encourages you to jump into the pit of success.
The Unlicense
1.73k stars 29 forks source link

add performance focused APIs #17

Open BurntSushi opened 3 months ago

BurntSushi commented 3 months ago

In building out Jiff, my focus has generally not been on performance. There are some Criterion benchmarks, but they are by no means exhaustive. I would love to have more added.

While I think there's always room to optimize the implementation (and if you have ideas that require major refactoring, please file an issue), this issue is more about API changes or additions that are needed to improve performance. For example, I'm thinking about things like "datetime arithmetic requires the use of Span, and Span is a very bulky type." At time of writing, its size is 64 bytes. I presume that this means that there is some non-trivial amount of time being spent in various operations on Span memcpying around this enormous bag of bytes.

Getting rid of Span or splitting it up isn't really an option as far as I'm concerned. So in order to make things go faster, I think we really just need to be able to provide APIs that operate on more primitive data types. Like, for example, an i64 of nanoseconds. This would come with various trade-offs, so I'd expect these APIs to be a bit more verbosely named.

Here's a probably non-exhaustive list of API design choices that likely have an impact on performance:

Anyway, before we get too focused on API design, I would like to see concrete use cases first in the form of executable code and/or benchmarks.

Diggsey commented 2 months ago

I think Zoned being !Copy isn't too much of an issue given what the type is designed to do. I would expect the Timestamp type to be far more commonly used in practice.

Timezones only really come into play in the front-end (which Rust is not typically used for...) or for some specific kinds of applications such as calendars, where events need to be scheduled at a "civil time" in the future.

Overall this crate seems very well designed, but if I have one criticism it's that I think it over-emphasizes the use of Zoned. Clearly there are cases when you need to store a runtime timezone, but IME they are few and far between (at least in the domains where Rust is most popular). Having this functionality, and having it interoperate perfectly with timestamps is very useful, but starting out with "The primary type in this crate is Zoned." is going to mislead some users I think.

If you intend this crate to displace chrono and time as "the" general purpose time crate, I would like to suggest:

BurntSushi commented 2 months ago

I think (3) is #24.

I think (2) is probably #21.

I am unsure about (1). I'm not sure if I agree that Timestamp is the right "default" to emphasize. Perhaps time zone aware datetimes aren't being used as much as they should be. I'm not sure. With that said, I am open to a jiff-core-style crate. But I think the API design and split will be challenging. In particular, I think it will be hard to do without sacrifices to API comprehensibility, which I am loathe to do. But like I said, I am open to it. I think we just need to collect more data. Another option is making time zone support a crate feature. Then other libraries can depend on jiff with a stripped down set of features.

Diggsey commented 2 months ago

Perhaps time zone aware datetimes aren't being used as much as they should be.

Heh, my experience is the opposite - in both JS and Python I've experienced significant extra complexity and bugs from their datetime types coming with a "packaged" timezone being used in places where timezones were not relevant.

In Python the main issue has been use of datetime where a simple timestamp would be more appropriate, since 99% of cases I'm dealing with a point in time, and the timezone is only relevant in so far as displaying that instant to the user in their preferred timezone. Storage should (in this case) be done in UTC and that's what the database expects.

In JS, only a couple of days ago I encountered a bug at work where a Date was being used to display a date (no time) but because the user's local timezone was offset it ended up showing the wrong date in the UI, despite being stored correctly in the backend. (Why did it use a JS Date? Because that's what the 3rd party date picker component was built around...) Variants of this bug are everywhere in JS. In this case a civil::Date with no timezone would have been more appropriate.

In contrast, I haven't had any issues with chrono or time in Rust since they don't really have the capacity to mess up the timezone. They just don't support the rare case where you do care about it.

It's honestly quite hard to think of use-cases where I'd want to store a datetime + timezone. Storing a future event in a calendar is the go-to example, but even then the only motiviation for that (vs storing a timestamp) is to deal with the edge-case where timezone data changes between the event being scheduled and it actually occurring, and doing that comes at a cost: it becomes impossible to schedule events at "ambiguous" times (at least with the default serialization implementation of Zoned).

If I want to build a best-possible calendar app, what's more likely? That the timezone data changes in a meaningful way, or that someone wants to schedule an event during a DST change... Both seem similarly likely.

(As an aside: maybe there's a case for changing the serialization to disambiguate these cases... It doesn't help if the timezone data changes and the event lies on a DST change, but that's a legitimate ambiguity in that case.)

Another option is making time zone support a crate feature.

Yeah that also works, I only didn't suggest it because I assumed you'd want to make the timezone functionality be default-enabled, and I really hate default features in Rust, since nobody remembers to turn them off.

BurntSushi commented 2 months ago

It is had to disentangle your examples of bugs with the underlying datetime library. Both of the datetime libraries you mention have very serious deficiencies.

it becomes impossible to schedule events at "ambiguous" times (at least with the default serialization implementation of Zoned).

I don't see why this would be true. You can't schedule an event at 2am on March 10, 2024 in New York because that clock time didn't exist. That seems right to me?

Yeah that also works, I only didn't suggest it because I assumed you'd want to make the timezone functionality be default-enabled, and I really hate default features in Rust, since nobody remembers to turn them off.

Yes, it would be enabled by default.

I think we overall need to collect more data. That will take some time and require users. I don't know that you are wrong, but I think you are.

BurntSushi commented 2 months ago

My very high level idea is that datetime libraries have sucked for so long, and this may have been inhibiting use cases or causing other problems that are falsely attributed. This is just an opinion though. So basically, I am excited to see what, if anything, Jiff unlocks for folks. IMO, its killer feature is not just Zoned but also Span. I think Span is going to take more time for folks to discover because I don't think anything like it has existed yet (until Temporal).

Diggsey commented 2 months ago

I don't see why this would be true. You can't schedule an event at 2am on March 10, 2024 in New York because that clock time didn't exist. That seems right to me?

That's a non-existent time, not an ambiguous time. If I schedule an event at a time that occurs twice due to clocks being put backwards, then jiff's Zoned type appears to be able to correctly disambiguate the actual time, but if I serialize/deserialize it, then that information will be lost. I would expect the default serialization format to faithfully round-trip the value.

IMO, its killer feature is not just Zoned but also Span. I think Span is going to take more time for folks to discover because I don't think anything like it has existed yet (until Temporal).

It's nice, but it's definitely not new. The interval type in PostgreSQL has always worked this way.

BurntSushi commented 2 months ago

That's a non-existent time, not an ambiguous time.

Ah yeah, both Temporal and Jiff refer to it as ambiguous. The thing that is ambiguous is the offset.

but if I serialize/deserialize it, then that information will be lost

No, it is not! This is a key feature of Jiff:

use jiff::{civil::date, tz::TimeZone};

fn main() -> anyhow::Result<()> {
    // The ambiguous civil datetime:
    let dt = date(2024, 11, 3).at(1, 30, 0, 0);
    // By default, the "compatible" disambiguation strategy is used. Which
    // selects the later time for a gap and the earlier time for a fold:
    let zdt = date(2024, 11, 3).at(1, 30, 0, 0).intz("America/New_York")?;
    assert_eq!(zdt.to_string(), "2024-11-03T01:30:00-04:00[America/New_York]");
    // One can override and pick whichever strategy they want:
    let tz = TimeZone::get("America/New_York")?;
    let zdt = tz.to_ambiguous_zoned(dt).later()?;
    // Notice the offset change from above:
    assert_eq!(zdt.to_string(), "2024-11-03T01:30:00-05:00[America/New_York]");

    Ok(())
}

See jiff::tz::AmbiguousZoned for more details.

The fact that Jiff losslessly roundtrips Zoned values is a key comparison point between it and Chrono.

It's nice, but it's definitely not new. The interval type in PostgreSQL has always worked this way.

It doesn't though? I'm not intimately familiar with PostgreSQL's interval type, but from a quick glance I can immediately see several differences:

So for example, on that last point, this is what PostgreSQL does:

postgres=# SELECT INTERVAL '1 months 27 days' + INTERVAL '3 days';
   ?column?
---------------
 1 mon 30 days
(1 row)

postgres=# SELECT INTERVAL '1 months 27 days' + INTERVAL '4 days';
   ?column?
---------------
 1 mon 31 days
(1 row)

postgres=# SELECT INTERVAL '1 months 27 days' + INTERVAL '5 days';
   ?column?
---------------
 1 mon 32 days
(1 row)

Compare this with Jiff:

use jiff::{civil::date, ToSpan};

fn main() -> anyhow::Result<()> {
    let span = 1.month().days(27);
    // The span has non-uniform units (months), so a
    // reference date is required to do addition. Otherwise,
    // you get an error:
    assert!(span.checked_add(3.days()).is_err());
    // So pick a date:
    assert_eq!(
        span.checked_add((3.days(), date(2024, 6, 1)))?,
        1.month().days(30)
    );
    // So pick a different date and notice that the span
    // returned reflects the actual duration of the
    // corresponding months:
    assert_eq!(
        span.checked_add((3.days(), date(2024, 1, 1)))?,
        2.months().days(1)
    );

    Ok(())
}

This all comes from Temporal and RFC 5545.

Diggsey commented 2 months ago

No, it is not! This is a key feature of Jiff:

Ah, my mistake - I didn't realise it would use the timezone offset as a disambiguator like that, nice!

It doesn't keep track of each unit separately. So it can't distinguish between 120 minutes and 2 hours.

It keeps track of units with variable ratios separately:

PostgreSQL stores interval values as months, days, and microseconds.

It doesn't keep track of each unit separately. So it can't distinguish between 120 minutes and 2 hours.

Because 120 minutes is always 2 hours, so it makes sense for them to be equal?

It doesn't seem to support rounding/balancing.

I can't say I've taken exhaustive stock of all the PostgreSQL date/time related functions, so perhaps this doesn't exist anywhere? I would be surprised though.

It doesn't seem to support using a reference date for adding spans, or at least lets you add spans together even when they have non-uniform units.

I'm not quite sure I understand this? Months are added separately to days, which are added separately to microseconds, so it doesn't matter what start date you have, date + (interval + interval) is always the same as (data + interval) + interval.

BurntSushi commented 2 months ago

Because 120 minutes is always 2 hours, so it makes sense for them to be equal?

I didn't say that it didn't make sense. But the fact that Jiff can distinguish them (but also compare them as equal via Span::compare) is a feature. Sometimes you want to express intervals in smaller units even when they can be balanced up into bigger units. It's use case dependent and there are no universal rules. So Jiff, inspired by Temporal, lets the caller choose which units are used. For example, if you wanted to show the length of a film, that is by convention expressed in units of minutes, even when it exceeds 1 hour.

I'm not quite sure I understand this? Months are added separately to days, which are added separately to microseconds, so it doesn't matter what start date you have, date + (interval + interval) is always the same as (data + interval) + interval.

The example I showed above demonstrates the issue. PostgreSQL can't balance days up into months. It does balance months up into years, because as you say, there's always 12 months in a year. But it can't do it for days. So you end up with things like 1 month 32 days instead of 2 months N days.

You're focused on the correctness of the result. A Span isn't just about correctness. It's about being able to freely move between any units of a duration and do it correctly. I'm not saying PostgreSQL is incorrect, I'm saying that it isn't nearly as flexible as what Jiff provides.

My understanding of the Temporal design goal here was to 1) satisfy RFC 5545 (iCalendar) and 2) put all the hard logic of dealing with durations into the datetime library such that localizing a duration becomes a much simpler task. I asked more about it here: https://github.com/tc39/proposal-temporal/issues/2915

Diggsey commented 2 months ago

PostgreSQL can't balance days up into months.

reference_date + interval - reference_date will round sub-days up to days although not up to months.

But I think we can agree that: 1) The representation used by PostgreSQL does not prevent such functionality being provided. 2) Jiff provides more functionality out of the box on top of that representation.

BurntSushi commented 2 months ago

Yes. And we haven't even gotten to the time zone aware aspects of Span when used with a Zoned in Jiff. From the PostgreSQL docs:

The default time zone is specified as a constant numeric offset from UTC. It is therefore impossible to adapt to daylight-saving time when doing date/time arithmetic across DST boundaries.