chronotope / chrono-tz

TimeZone implementations for rust-chrono from the IANA database
Other
232 stars 55 forks source link

Very large memory size for DateTime carrying Tz from chrono-tz #27

Open breezewish opened 5 years ago

breezewish commented 5 years ago
use chrono::prelude::*;
use chrono_tz::US::Pacific;
let pacific_time = Pacific.ymd(1990, 5, 6).and_hms(12, 30, 45);
println!("size_of date time with tz = {}", ::std::mem::size_of_val(&pacific_time));

let dt = Utc.ymd(2014, 7, 8).and_hms(9, 10, 11);
println!("size_of date time = {}", ::std::mem::size_of_val(&dt));

let fixed_dt = FixedOffset::east(9 * 3600).ymd(2014, 7, 8).and_hms_milli(18, 10, 11, 12);
println!("size_of date time with fixed offset = {}", ::std::mem::size_of_val(&fixed_dt));

gives output:

size_of date time with tz = 48
size_of date time = 12
size_of date time with fixed offset = 16

As you can see, DateTime carrying chrono-tz's TimeZone is a very expansive structure, occupying 48 bytes. This is bad for cache utilization and is expansive when copying the structure around.

neoeinstein commented 3 years ago

I have an idea for fixing this up and potentially improving the speed of the timezone calculations. I'll see if I can put together a PR of some sort this week.

pitdicker commented 4 months ago

Last year I did some calculations on the minimum size needed for a TzOffset.

It currently has the definition:

pub struct TzOffset {
    tz: Tz, // ~600 variants so at least 16-bit
    offset: FixedTimespan,
}
pub struct FixedTimespan {
    pub utc_offset: i32,
    pub dst_offset: i32,
    pub name: &'static str, // pointer + usize
}

Combined this type needs 26 bytes with 8-byte alignment on 64-bit platforms. When added to the 12-byte DateTime it becomes 12 + 4 (padding) + 26 + 6 (padding) = 48 bytes.

How much bits do we actually need?

Combined that would put the minimum size for TzOffset to 42 bits.

Optimization with a medium-sized table

Currently we store the abbreviations as a large number of tiny slices, which adds quite some overhead to the binary. Concatenating them in a ~500 char string as I proposed above is one option.

Alternatively we could make a table of all TZ enum variants and abbreviation combinations. Each abbreviation would be stored in an [u8, 6] with the first byte being the length. I estimate the table to have ~1500 entries and be ca. 13kb. Compared to how we currently store the data that might not even be an increase in binary size.

Just 11 bits is enough to index into the table and get the TZ enum variant and an abbreviation. That would bring the bits needed in TzOffset down to 17 (offset from UTC) + 3 (dst) + 11 (table index) = 31 bits, i.e. 4 bytes.

DateTime<Tz> could then be 16 bytes, just like DateTime<FixedOffset>.

djc commented 4 months ago

I just want to warn that I don't think there's overwhelming evidence that the size of these types are causing problems for lots of people, so optimizations here should be carefully balanced against the amount of complexity they introduce.