JuliaTime / TimeZones.jl

IANA time zone database access for the Julia programming language
Other
86 stars 51 forks source link

Reduce Unicode CLDR artifact size #373

Closed jaakkor2 closed 3 months ago

jaakkor2 commented 2 years ago

On my local drive size of .julia\artifacts\8d7201f06ff8061c1b5bff0f661b5297ec1d3158\cldr-release-40 is 318 MB.

If I searched code correctly, only one 49 kB file is actually used https://github.com/unicode-org/cldr/blob/release-40/common/supplemental/windowsZones.xml here https://github.com/JuliaTime/TimeZones.jl/blob/master/src/winzone/WindowsTimeZoneIDs.jl

Could the artifact point directly to the windowsZones.xml file?

jaakkor2 commented 1 year ago

..\.julia\artifacts\768b8b82edd878449348f31bf989867bb13e42a9\cldr-release-43 folder size is 335 MB, and this package is using one file of it (IIUC) ..\.julia\artifacts\768b8b82edd878449348f31bf989867bb13e42a9\cldr-release-43\common\supplemental\windowsZones.xml has the size of 49 kB.

omus commented 1 year ago

You are correct that TimeZones.jl only requires the windowsZones.xml file from the CLDR artifacts. This is an unfortunate side effect of this package switching to use Julia artifacts which requires an archive to be referenced and not just a flat file.

In order to address this problem with the artifacts system we'd need to generate our own permanent artifact archive for each CLDR release. Probably the right approach would be to have these artifacts as GitHub release assets which follows what JLLs do. It also probably makes sense to have this not be a part of the TimeZones.jl repo but probably a CLDRWindowsZones_jll repo so that the tags/releases make more sense. Additionally, it may make more sense to make a CLDR_jll.jl and have just the windowsZones.xml included in a special release asset so we can retrieve it separately.

The end result is having a much smaller artifact footprint as well as being able to update CLDR versions without having to make a new TimeZones.jl release. Unfortunately, this is a bunch of work but I do want to tackle this before or around JuliaCon 2023.

omus commented 11 months ago

The PR https://github.com/JuliaTime/TimeZones.jl/pull/439 changed the Unicode CLDR artifacts to no longer be non-lazy and platform specific. Although this doesn't impact the download size on Windows machines it does limit the downloads to only be on Windows.

It still makes sense to make separate data package with this information akin to what was done in #441. I do want to have #441 be out in the wild for a little while to ensure this artifact model works well.

visr commented 11 months ago

Besides the artifact size, the many nested directories also can cause issues with the maximum file path on Windows.

In this case using PackageCompiler to create a library with the cldr artifact led to an error like:

ERROR: LoadError: IOError: open("libribasim\\share\\julia\\artifacts\\40b35727ea0aff9a9f28b7454004b68849caf67b\\cldr-release-43-1\\tools\\cldr-code\\src\\main\\resources\\org\\unicode\\cldr\\util\\data\\transforms\\internal_English-IPA-backwards.txt", 769, 33060): no such file or directory (ENOENT)

Enabling long paths in the Windows Registry fixed the issue.

https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation

visr commented 9 months ago

It still makes sense to make separate data package with this information akin to what was done in https://github.com/JuliaTime/TimeZones.jl/pull/441.

@omus if I understand correctly windowsZones.xml is only used to fill const WINDOWS_TRANSLATION = Dict{String, String}(). Would you consider a PR that switches from the cldr artifact to an auto-generated Dict literal? It's only about 140 entries.

I had a look at the tzlocal Python package and saw they do the same. This is their dict literal, which they generate from this script. The dict lookup is done here after reading the registry (rather than tzinfo).