daypack-dev / timere

OCaml date time handling and reasoning suite
MIT License
68 stars 7 forks source link

Better time zone workflow for JS target #46

Closed darrenldl closed 2 years ago

darrenldl commented 2 years ago

Right now Time_zone.local attempts construction using name guesses (strings) from Timedesc_tzlocal.local, but this would fail if tzdb.none backend is picked

Run time retrieval of tzdb json file without being typed to a Lwt promise was not possible iirc, so perhaps one way forward is to expose the string guesses as an alias (Timedesc.Time_zone.local_string perhaps).

darrenldl commented 2 years ago

@glennsl Any thoughts?

glennsl commented 2 years ago

Hmm, I'm not sure where that would get us. Once you have the guesses, what would you do with them?

Also, as I mentioned in #42, it is possible to query the JS environment's time zone database using Intl.DateTimeFormat, but that requires abstracting away the lookup table as you have to query the offset for every timestamp. And I'm not sure if this would be too limited for how the time zones are currently used.

darrenldl commented 2 years ago

Once you have the guesses, what would you do with them?

This would allow loading the appropriate JSON file of the time zone, instead of loading every single time zone in tzdb.

Also, as I mentioned in https://github.com/daypack-dev/timere/issues/42, it is possible to query the JS environment's time zone database using Intl.DateTimeFormat, but that requires abstracting away the lookup table as you have to query the offset for every timestamp. And I'm not sure if this would be too limited for how the time zones are currently used.

That would likely be too slow at least for timere, correct. For timedesc, it may just be easier to ask user to construct a JS time and convert into Timedesc.t, if desirable at all.

glennsl commented 2 years ago

This would allow loading the appropriate JSON file of the time zone, instead of loading every single time zone in tzdb.

Ah, so off-loading the async work to the consumer. Yeah that would be a nice option to have. It does have some significant architectural repercussions though, as the consumer will either have to pass this time zone around everywhere, or somehow deal with the possibility that the local timezone is not yet known.

That would likely be too slow at least for timere, correct.

It's definitely a trade-off, but apparently not bad enough that the successor of the most widely used JS date time library makes any note about a noticeable performance impact in its documentation. It would be a nice option to have, and also nice to have some numbers to evaluate alternatives.

For timedesc, it may just be easier to ask user to construct a JS time and convert into Timedesc.t, if desirable at all.

Then you'll just have a fixed offset timezone with all the problems that come with that, right?

A third option might be to offer a reduced time zone database without all the historical baggage. Spacetime manages to get down to 47kB that way, which is quite acceptable. Compared to moment-timezone at 479 kB and date-fns-timezone at 922 kB. And compared (unfairly) to timere's own unminifed full tzdb at 5.4 MB.

darrenldl commented 2 years ago

It's definitely a trade-off, but apparently not bad enough that the successor of the most widely used JS date time library makes any note about a noticeable performance impact in its documentation. It would be a nice option to have, and also nice to have some numbers to evaluate alternatives.

Timere needs access to the (full) transition table to resolve pattern matching queries.

Timedesc is comparable to Luxon yes, though I'll have to think about how this would work out - we'll have to replace the stored transition tables with an oracle that interacts with Intl from what I understand.

Then you'll just have a fixed offset timezone with all the problems that come with that, right?

True.

A third option might be to offer a reduced time zone database without all the historical baggage. Spacetime manages to get down to 47kB that way, which is quite acceptable. Compared to moment-timezone at 479 kB and date-fns-timezone at 922 kB. And compared (unfairly) to timere's own unminifed full tzdb at 5.4 MB.

Approach of Spacetime (https://github.com/spencermountain/spacetime/blob/master/zonefile/iana.js) seems to be just restricting ourselves to the immediate next 12 months-ish, namely for Sydney we have

  "australia/sydney": {
    "offset": 10,
    "hem": "s",
    "dst": "04/03:03->10/02:02"
  },

which is just the next transition (without even the year).

Similarly for https://github.com/vvo/tzdb. Both libraries seem assume the application will always load the latest rebuilt library from NPM, and you don't compute anything more than one year into the future. I'll add that I find their exact approach very fragile...trimming too much to save space.

If that is indeed what suffices for JS usage, I am happy to add a tzdb-js backend that only includes very recently months of data.

glennsl commented 2 years ago

Timere needs access to the (full) transition table to resolve pattern matching queries.

I see. Yeah that certainly makes this trickier.

Timedesc is comparable to Luxon yes, though I'll have to think about how this would work out - we'll have to replace the stored transition tables with an oracle that interacts with Intl from what I understand.

Yeah, for using Intl it seems to be a function tz -> timestamp -> offset.

which is just the next transition (without even the year).

Does the transition really happen at different times every year? Or is it just that there's a (very small) possibility that it will?

If that is indeed what suffices for JS usage, I am happy to add a tzdb-js backend that only includes very recently months of data.

I think it's at least very unlikely that most front-end applications will deal with datetimes before 1970 and after, say, 10 years into the future. So filtering out anything outside that should already reduce the size quite a bit.

For our specific use case, which is quite time-centric, we have some historical data going back to the 90s I believe. But that's just used for machine learning on the back end. On the front end we don't show historical data from before 2015, and only look about a year into the future. And if we were to display historical data somehow it would still just be with dates (or more likely weeks at the smallest granularity), so no offset is really needed anyway.

I can also imagine other formats for the time zone database that would allow more flexibility with a smaller footprint, such as a rule-based format that enables computing the table on demand. A rule could be in the form { from: 1491062400, to: 1506787200, offset: 36000, transitions: ["04/03:03->10/02:02"] }. And even that could be compressed further by inferring from or to from the the next or previous entry. But this probably requires both an encoder and decoder, as I assume the source you use for the time zone database is in table form already.

darrenldl commented 2 years ago

Does the transition really happen at different times every year? Or is it just that there's a (very small) possibility that it will?

For most places afaict, yes - I think government bodies tend to define it around "nth weekday of some month" or something like that? But no strict rule on a specific pattern being followed.

I actually don't know if the date then becomes fixed if we switch to ISO week calendar - I'll have to check.

I think it's at least very unlikely that most front-end applications will deal with datetimes before 1970 and after, say, 10 years into the future. So filtering out anything outside that should already reduce the size quite a bit.

For our specific use case, which is quite time-centric, we have some historical data going back to the 90s I believe. But that's just used for machine learning on the back end. On the front end we don't show historical data from before 2015, and only look about a year into the future. And if we were to display historical data somehow it would still just be with dates (or more likely weeks at the smallest granularity), so no offset is really needed anyway.

Fair enough - sounds like a tzdb-slim (or some better name) will be a good addition.

I can also imagine other formats for the time zone database that would allow more flexibility with a smaller footprint, such as a rule-based format that enables computing the table on demand. A rule could be in the form { from: 1491062400, to: 1506787200, offset: 36000, transitions: ["04/03:03->10/02:02"] }. And even that could be compressed further by inferring from or to from the the next or previous entry. But this probably requires both an encoder and decoder, as I assume the source you use for the time zone database is in table form already.

Transitions in the general case cannot be algorithmically generated, might be the case for contemporary periods using ISO week calendar as suggested above - don't know for certain yet.

darrenldl commented 2 years ago

I checked and it seems to be the case for Australia/Sydney for 2021/2022 that you can state the split using a fixed ISO week date (varying year), but I don't have time to write code to check if that's always the case for all time zones in recent +/- N years at the moment.

Another (simpler) idea is to simply pass the table through a compression pass, which might already yield a fair bit of savings.

glennsl commented 2 years ago

I think government bodies tend to define it around "nth weekday of some month" or something like that?

Ah, of course they do :roll_eyes:

Transitions in the general case cannot be algorithmically generated,

I don't see why not. In the worst case you'd just have a "rule" for every transition, which is effectively status quo (with a bit more overhead). Anyway, I'm not trying to say that the specific rule format I suggested would be optimal, or even better than a table, but just seed the idea that the database doesn't have to be stored as a bulky lookup table, but could be (partly) computed on demand based on some kind of rules. Coming up with the actual rules, format and encoding process is of course still a non-trivial task.

I checked and it seems to be the case for Australia/Sydney for 2021/2022 that you can state the split using a fixed ISO week date (varying year), but I don't have time to write code to check if that's always the case for all time zones in recent +/- N years at the moment.

Promising. Thanks for checking! This is just optimization, and so not very time sensitive, but I might take this on as a fun challenge if and when I get some time.

Another (simpler) idea is to simply pass the table through a compression pass, which might already yield a fair bit of savings.

There's definitely lots of potential for compression, but they ought to be transferred in gzipped form already, and so I don't imagine traditional compression would do much for transfer size. It could help with memory use though, which is especially important on mobile.

It would also be easy to remove quite a bit of bloat just by transforming the entries from

[ "3920576400", { "is_dst": false, "offset": 3600 } ]

to

[ 3920576400, 0, 3600 ]
darrenldl commented 2 years ago

Re compression: I was primarily thinking of at the marshalled level (so it'd be without the clutter of JSON already).

Also might be even better to switch to hashtable as the underlying store...

EDIT for context: the install process looks like sexp -> map -> marshalled string -> insert into .ml file, Timedesc unmarshals the string upon loading.

glennsl commented 2 years ago

Right, yeah I had no idea what that process was. Thanks for the context!

Here's another interesting idea. Try running this JS code on one of the JSON time zone objects:

{
  var last = 0;
  tz.table.map(([time, {offset}]) => {
    let delta = time - last;
    last = time;
    return [delta, offset];
  })
}

It's not hard to see the recurring patterns, but also that there's a very small set of unique entries. Which means you could encode this in a two-level lookup table. E.g.:

{
  "L1": {
    1: [ 16934400, 3600 ],
    2: [ 14515200, 0 ].
    ...
  },
  "L2": [1, 2, 1, 2, 3, 2, 3, 4, 1, 2, 1, ...]
}

with the entries in L2 indexing into L1. And from this you could easily build the full lookup table on demand.

I imagine you could do much better than this by taking the recurring patterns into account, but this could already reduce the database size by 80% if we're lucky, I think.

Edit: Also, what's the signficance of the is_dst flag? I've ignored it here, but assume it would line up pretty well regardless.

darrenldl commented 2 years ago

Hm...adjusting to a different binary encoding would indeed save us a lot of space, but then the initialisation of timedesc could be significantly slower compared to just unmarshalling the string - I don't know if this is the case.

is_dst is largely for completeness - I don't really recall if it's actively used in the suite, but don't think it's used to any meaningful extent.

glennsl commented 2 years ago

Would it be hard to unmarshal on demand instead?

glennsl commented 2 years ago

A few examples of the potential effectiveness of this encoding:

darrenldl commented 2 years ago

Come to think of it, the main table is just a pair of array, so really as efficient as one can be when maximising for both space efficiency and initialisation efficiency.

Would it be hard to unmarshal on demand instead?

The change would be transparent as type Timedesc.Time_zone.db is abstract, so we're free to do the on demand tricks internally.

darrenldl commented 2 years ago

A few examples of the potential effectiveness of this encoding: ...

Ah right, you're mapping that way...right, we can then use a byte/char as the index, with dynamic unmarshalling...yeah okay this sounds like a good idea.

darrenldl commented 2 years ago

Yeah okay, this is very doable, just that adding the relevant testing will take some time (on top of me lacking a functional desktop right now for fuzzing).

glennsl commented 2 years ago

Cool! Let me know if you want me to pitch in with something. I could for example write the encoding script. I don't know what you use as a source though.

glennsl commented 2 years ago

Actually, this would probably be useful for many other projects as well, across languages, so perhaps we should set up a separate repository for the time zone database, with ocaml and js packages to encode and decode. That would give it more real world testing, some help with keeping it up to date etc.

darrenldl commented 2 years ago

Cool! Let me know if you want me to pitch in with something. I could for example write the encoding script. I don't know what you use as a source though.

We need a Timedesc.Time_zone.t -> string serialiser and string -> Timedesc.Time_zone.t deserialiser with roughly the following format

| uniq offset count (int8) | uniq offset #0 in minutes (int16) | uniq offset #1 in minutes (int16) | ...
| timestamp in seconds since unix epoch (int64) | index into uniq offset count (int8) | is DST (int8) | ...

We can then just use string String_map.t to represent the encoded db and unmarshal into an entry for Time_zone.Db.entry String_map.t on the fly (or something like that).

darrenldl commented 2 years ago

Actually, this would probably be useful for many other projects as well, across languages, so perhaps we should set up a separate repository for the time zone database, with ocaml and js packages to encode and decode. That would give it more real world testing, some help with keeping it up to date etc.

Hm...yeah interesting idea - our current representation is copied from Rust chrono-tz crate (I hope I got the name right), I'm curious if they had similar concerns since they could be targeting use cases in WASM.

One issue with that is then we'll have to put extra time into making it future proof etc (which uh would exceed my free time budget : D).

glennsl commented 2 years ago

We can then just use string String_map.t to represent the encoded db and unmarshal into an entry for Time_zone.Db.entry String_map.t on the fly (or something like that).

Sounds like a good plan. I don't quite understand the rationale behind this format though. A lookup table for offsets would offer some savings, but only 50% of the offset field at most, since it's replacing an int16 with an int8. The offset is only 2/7ths the size of the entry as a whole, however, which means it would amount to a 15% reduction in total size at most, even disregarding the overhead of the L1 table.

What I show above is that, assuming is_dst is always in sync with offset, the number of unique entries converted to relative time is only 1/10th the full number of entries. So you could have an L1 table that is 10% the size of the full table, and an L2 table that is 15% the size of the full table (because we replace 7 bytes with a 1 byte index into L1). Which in total is a 75% reduction in size. Quite an improvement from 15%!

That would be something more like this:

| uniq entry count (int8) | delta #0 (int64) | is_dst #0 (int8) | offset #0 (int8) | delta #1 (int64) | ...
| index into uniq entry #0 (int8) | index into uniq entry #1 (int8) | index into uniq entry #2 (int8) | ...

I'm curious if they had similar concerns since they could be targeting use cases in WASM.

Rust WASM is already pretty bloated (I say as my jsoo bundle size just passed 10MB, unminified and whatnot, but still), so my guess would be no :laughing: It's a good question, and definitely something that would be worth looking into. But from a quick search of their issues I can't find anything.

One issue with that is then we'll have to put extra time into making it future proof etc (which uh would exceed my free time budget : D).

Ah, true! Maybe not at first then, but something we could have in mind for the future!

glennsl commented 2 years ago

But from a quick search of their issues I can't find anything.

Oh, searched the wrong repo! I found two issues mentioning size, neither of them related to WASM. The solution they focus on seem to be various ways of filtering which timezones are included.

glennsl commented 2 years ago

The original time zone database actually seems to be in a rule based format already

Definitely not a simple format though. I think I prefer the compression scheme we've come up with here, even if it ends up a bit bulkier.

darrenldl commented 2 years ago

What I show above is that, assuming is_dst is always in sync with offset, the number of unique entries converted to relative time is only 1/10th the full number of entries. So you could have an L1 table that is 10% the size of the full table, and an L2 table that is 15% the size of the full table (because we replace 7 bytes with a 1 byte index into L1). Which in total is a 75% reduction in size. Quite an improvement from 15%!

Right relative time, right okay. This requires investigation on the exact upperbound of the gaps...which will need another set of code...though I think 32bit is a safe bet, and we can just make the code gen raise an exception when it's not 32bit.

glennsl commented 2 years ago

I was thinking we might just keep using int64, for simplicity. That will also allow using 0 as the starting point while still starting at "the beginning of history" (i.e. the first delta being Int64.min). Otherwise we'll also need a starting time field I think. Going from int64 to int32 will also just yield 1.5 percentage points or so additional reduction, because we've already reduced the total number of timestamps by 90%.

darrenldl commented 2 years ago

I added the serialisation side (with some more conservative choices in the layout for now), and the size of tzdb_marshalled.ml shrunk from 2.9M to 929K : D (on branch alt-tzdb-encoding)

Deserialisation is not implemented yet.

darrenldl commented 2 years ago

Swapping from 16 bit to 8 bit for index shrinks to down to 667K

darrenldl commented 2 years ago

Made some further adjustment to reduce width of delta and offset, now size is at 485K

darrenldl commented 2 years ago

The format should be relatively obvious from Time_zone.Compressed_table.add_to_buffer should you want to make further adjustment or roll a deserialiser (personally prefer using angstrom).

glennsl commented 2 years ago

That's awesome! Almost 85% reduction in size! I'm surprised the encoding of offset needs to be this involved though, but I guess that make sense for arcane historical reasons.

I see you've added a deserializer too now, so seems you have full control! I continue to be impressed about how fast you manage to make things happen after I suggest something. Thanks so much for this!

darrenldl commented 2 years ago

That's awesome! Almost 85% reduction in size! I'm surprised the encoding of offset needs to be this involved though, but I guess that make sense for arcane historical reasons.

Yeah, always some oddities in tzdb sadly.

I'll make a different PR for reducing the number of years we include in tzdb (reorganising the tzdb backends a bit).

I see you've added a deserializer too now, so seems you have full control! I continue to be impressed about how fast you manage to make things happen after I suggest something. Thanks so much for this!

: D thanks!

Right now I'm doing the most boring part - testing the (de)serialisation (laptop not very good at running ever so slightly heavy tests...)

darrenldl commented 2 years ago

Okay finished debugging - quickcheck catches a lot of errors as usual, namely Int64.of_int32 repeats the sign bit (forgot this entirely) and overflow/underflow in some tables.

glennsl commented 2 years ago

I see it's been merged now. Nice work! It seems that the whole database is decompressed at the same time though, which means you'll still have to pay the memory and cpu cost of all the time zones even if you most likely just need one of them. This is especially unfortunate on mobile, which is generally more memory and cpu-constrained, but also power-constrained. And just having things in memory consumes quite a bit of power, as I understand it. Would it be possible to have a map with time zones compressed individually, and only decompress them as needed?

Also looking forward to see how much removing historic a far future time zone transitions will affect the size now!

darrenldl commented 2 years ago

I see it's been merged now. Nice work!

: D

It seems that the whole database is decompressed at the same time though, which means you'll still have to pay the memory and cpu cost of all the time zones even if you most likely just need one of them. This is especially unfortunate on mobile, which is generally more memory and cpu-constrained, but also power-constrained. And just having things in memory consumes quite a bit of power, as I understand it. Would it be possible to have a map with time zones compressed individually, and only decompress them as needed?

Yep that's the intended direction, just wanted to make that a separate PR.

Also looking forward to see how much removing historic a far future time zone transitions will affect the size now!

Indeed, what's a good default you reckon? +- 20 years?

glennsl commented 2 years ago

Yep that's the intended direction, just wanted to make that a separate PR.

Awesome!

Indeed, what's a good default you reckon? +- 20 years?

I think most application's needs are likely asymmetric, and my gut feeling is +10/-30 years. That'd cover the nineties, and therefore the whole mainstream digital/internet era. I also can't imagine most applications needing to project more than 10 years into the future.

darrenldl commented 2 years ago

Hey @glennsl I've adjusted the generator pipeline and shrunk the year range to 1990 to 2040, tzdb_compressed.ml is now at 242k, and the compiled size is 95K if I'm reading it right (I'm looking at tzdb_full.cmxs - is that correct?

glennsl commented 2 years ago

Hey. I'm not sure how I'd be able to confirm those numbers independently, but it certainly seems plausible! As I would expect most of the space being taken up by irregularities, and most irregularities to occur in early history.

I wonder how much it would actually increase if you include more recent history. As long as the pattern remains the same, it would just add a few more entries to the L2 lookup table. Would it be easy to check how much difference going back to 1970 would make, for example?

darrenldl commented 2 years ago

Yeah I streamlined the year range specification, so you can do the adjustment in the Makefile gen section

.PHONY: gen
gen :
        cd gen/ && dune build gen_time_zone_data.exe
        dune exec gen/gen_time_zone_data.exe -- 1990 2040 full

Do a make gen to run and parse the zdump output, then make to make sure everything's rebuilt properly.

Finally do ls -lh _build/default/tzdb-full to check the size (I believe that's the right place):

(The order here is different since I was doing the ls outside of the container)

$ ls -lh _build/default/tzdb-full/
total 532K
-r--r--r-- 1 darren darren  381 Mar  5 02:33 dune
-r--r--r-- 1 darren darren  84K Mar  6 00:48 timedesc_tzdb_full.a
-r--r--r-- 1 darren darren  89K Mar  6 00:48 timedesc_tzdb_full.cma
-r--r--r-- 1 darren darren  914 Mar  6 00:48 timedesc_tzdb_full.cmxa
-r-xr-xr-x 1 darren darren  95K Mar  6 00:48 timedesc_tzdb_full.cmxs
-r--r--r-- 1 darren darren  231 Mar  4 03:39 timedesc_tzdb.ml
-r--r--r-- 1 darren darren  104 Mar  4 03:39 timedesc_tzdb__timedesc_tzdb_full__.ml-gen
-r--r--r-- 1 darren darren 242K Mar  6 00:48 tzdb_compressed.ml

For 1850 - 2040, make gen && make && ls -lh _build/default/tzdb-full/ yields

total 848K
-r--r--r-- 1 root root  381 Mar  4 15:33 dune
-r--r--r-- 1 root root  231 Mar  3 16:39 timedesc_tzdb.ml
-r--r--r-- 1 root root  104 Mar  3 16:39 timedesc_tzdb__timedesc_tzdb_full__.ml-gen
-r--r--r-- 1 root root 134K Mar  6 13:15 timedesc_tzdb_full.a
-r--r--r-- 1 root root 139K Mar  6 13:15 timedesc_tzdb_full.cma
-r--r--r-- 1 root root  914 Mar  6 13:15 timedesc_tzdb_full.cmxa
-r-xr-xr-x 1 root root 145K Mar  6 13:15 timedesc_tzdb_full.cmxs
-r--r--r-- 1 root root 408K Mar  6 13:15 tzdb_compressed.ml

For 1850 - 2100, the command yields

total 1012K
-r--r--r-- 1 root root  381 Mar  4 15:33 dune
-r--r--r-- 1 root root  231 Mar  3 16:39 timedesc_tzdb.ml
-r--r--r-- 1 root root  104 Mar  3 16:39 timedesc_tzdb__timedesc_tzdb_full__.ml-gen
-r--r--r-- 1 root root 159K Mar  6 13:08 timedesc_tzdb_full.a
-r--r--r-- 1 root root 164K Mar  6 13:08 timedesc_tzdb_full.cma
-r--r--r-- 1 root root  914 Mar  6 13:08 timedesc_tzdb_full.cmxa
-r-xr-xr-x 1 root root 169K Mar  6 13:08 timedesc_tzdb_full.cmxs
-r--r--r-- 1 root root 498K Mar  6 13:08 tzdb_compressed.ml

For 1970 - 2040, the command yields

total 628K
-r--r--r-- 1 root root  381 Mar  4 15:33 dune
-r--r--r-- 1 root root  231 Mar  3 16:39 timedesc_tzdb.ml
-r--r--r-- 1 root root  104 Mar  3 16:39 timedesc_tzdb__timedesc_tzdb_full__.ml-gen
-r--r--r-- 1 root root  99K Mar  6 13:09 timedesc_tzdb_full.a
-r--r--r-- 1 root root 104K Mar  6 13:09 timedesc_tzdb_full.cma
-r--r--r-- 1 root root  914 Mar  6 13:09 timedesc_tzdb_full.cmxa
-r-xr-xr-x 1 root root 110K Mar  6 13:09 timedesc_tzdb_full.cmxs
-r--r--r-- 1 root root 293K Mar  6 13:09 tzdb_compressed.ml

For 1970 - 2100, the command yields

total 796K
-r--r--r-- 1 root root  381 Mar  4 15:33 dune
-r--r--r-- 1 root root  231 Mar  3 16:39 timedesc_tzdb.ml
-r--r--r-- 1 root root  104 Mar  3 16:39 timedesc_tzdb__timedesc_tzdb_full__.ml-gen
-r--r--r-- 1 root root 125K Mar  6 13:18 timedesc_tzdb_full.a
-r--r--r-- 1 root root 130K Mar  6 13:18 timedesc_tzdb_full.cma
-r--r--r-- 1 root root  914 Mar  6 13:18 timedesc_tzdb_full.cmxa
-r-xr-xr-x 1 root root 135K Mar  6 13:18 timedesc_tzdb_full.cmxs
-r--r--r-- 1 root root 384K Mar  6 13:18 tzdb_compressed.ml

For 1990 - 2040, the command yields

total 532K
-r--r--r-- 1 root root  381 Mar  4 15:33 dune
-r--r--r-- 1 root root  231 Mar  3 16:39 timedesc_tzdb.ml
-r--r--r-- 1 root root  104 Mar  3 16:39 timedesc_tzdb__timedesc_tzdb_full__.ml-gen
-r--r--r-- 1 root root  84K Mar  6 13:12 timedesc_tzdb_full.a
-r--r--r-- 1 root root  89K Mar  6 13:12 timedesc_tzdb_full.cma
-r--r--r-- 1 root root  914 Mar  6 13:12 timedesc_tzdb_full.cmxa
-r-xr-xr-x 1 root root  95K Mar  6 13:12 timedesc_tzdb_full.cmxs
-r--r--r-- 1 root root 242K Mar  6 13:12 tzdb_compressed.ml

For 1990 - 2100, the command yields

total 708K
-r--r--r-- 1 root root  381 Mar  4 15:33 dune
-r--r--r-- 1 root root  231 Mar  3 16:39 timedesc_tzdb.ml
-r--r--r-- 1 root root  104 Mar  3 16:39 timedesc_tzdb__timedesc_tzdb_full__.ml-gen
-r--r--r-- 1 root root 109K Mar  6 13:13 timedesc_tzdb_full.a
-r--r--r-- 1 root root 114K Mar  6 13:13 timedesc_tzdb_full.cma
-r--r--r-- 1 root root  914 Mar  6 13:13 timedesc_tzdb_full.cmxa
-r-xr-xr-x 1 root root 120K Mar  6 13:13 timedesc_tzdb_full.cmxs
-r--r--r-- 1 root root 341K Mar  6 13:13 tzdb_compressed.ml
glennsl commented 2 years ago

Hmm, interesting. The size increases with about 50% of the increase in time for 1990 vs 1970. Much bigger than I thought it would be. Compared to ~33% for 1990 vs 1850 and ~25% for 2040 vs 2100. Much better than 100% (or more) of course, but still quite expensive to do so just in case. Unless there are good reasons for extending the range I think 1990 is a good default.

Thanks for getting the numbers and explaining the procedure!

darrenldl commented 2 years ago

The size increases with about 50% of the increase in time for 1990 vs 1970

I don't think I follow this

glennsl commented 2 years ago

The number of years between 1990 and 2040 is 50, and between 1970 and 2040 is 70. That's a 40% increase in time covered. And the size for those ranges is 532K and 628K respectively, which is an 18% increase. Therefore the increase in size is about half that of the increase in time covered.

Hope that's easier to follow (and also free of mistakes!)

darrenldl commented 2 years ago

The total is all files combined though - we're only interested in comparing the sizes of one of the compiled object files or tzdb_compressed.ml.

So for say 1970 to 2040, my understanding is that it's roughly 110K when compiled.

glennsl commented 2 years ago

Ah yes, sorry. The difference is roughly the same though. For tzdb_compressed.ml it's 21%, and for the .a and .cxms files it's 17%

darrenldl commented 2 years ago

I am leaning toward 1970 - 2040 for the final publish, what you think?

I believe the tests cover everything we've added

lookup_record in desc/time_zone.ml is made more complicated for lazy loading:

let lookup_record name : record option =
  match M.find_opt name !db with
  | Some table ->
    assert (check_table table);
    Some (process_table table)
  | None ->
    match M.find_opt name compressed with
    | Some compressed_table ->
      let table =
        Compressed_table.of_string_exn compressed_table
      in
      assert (check_table table);
      db := M.add name table !db;
      Some (process_table table)
    | None -> None

but implicitly tested by the tzdb_make_all test.

If you don't spot anything missing I am going to finalise timedesc.0.7.0 and submit later.

glennsl commented 2 years ago

Sweet! Can't think of anything, no.

I am leaning toward 1970 - 2040 for the final publish, what you think?

1970 is a very natural point in time that is likely to conform to users expectations thanks to Unix time. I still think 1990 is sufficient for 99% of use cases though, and that it's relatively expensive to extend it to 1970 given that.Then again, we've already reduced it by quite a lot, so perhaps it's time to relax a bit :smile:

darrenldl commented 2 years ago

Completed with timedesc (and timedesc-tzlocal-js) 0.8.0

One additional adjustment since then is removal of use of Marshal entirely, to make timedesc-tzdb a standalone package. The compressed form is stored entirely in repo now, so construction of a marshaled tzdb through timedesc is no longer needed.