omus commented 3 years ago

Experimenting with an alternative approach to #344. Part of the reason for having the time zone cache is to avoid having to read from disk. The approach in #344 results in each thread having to load data from disk which I think is slower than this approach (performance tests to come).

Fixes #342

codecov-commenter commented 3 years ago

Codecov Report

Merging #345 (8cf4295) into master (71178fe) will increase coverage by 0.01%. The diff coverage is 92.30%.

:exclamation: Current head 8cf4295 differs from pull request most recent head 6522f7a. Consider uploading reports for the commit 6522f7a to get more accurate results

@@            Coverage Diff             @@
##           master     #345      +/-   ##
==========================================
+ Coverage   93.60%   93.61%   +0.01%     
==========================================
  Files          32       32              
  Lines        1532     1536       +4     
==========================================
+ Hits         1434     1438       +4     
  Misses         98       98

Impacted Files	Coverage Δ
src/types/timezone.jl	`89.65% <92.30%> (+1.65%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 71178fe...6522f7a. Read the comment docs.

NHDaly commented 3 years ago

@omus: Yeah, this seems like a valid approach as well.

The problem is that with this approach, only one task can access the cache at a time, which introduces a lot of contention. In particular, we may have as many as 64 threads running that are constructing TimeZones in parallel, and we don't want to have to synchronize on every access to the cache.

Using a Read-Write Lock here would probably work, if you wanted to go with a locking approach.

The idea in #344 is that each thread would have to build the cache once, but since there are only a small number of threads (note: OS threads, not julia Tasks), this cache initialization cost should only have to be payed a small handful of times, probably usually 4-8 for most systems, and as many as 32 or 64 in very-wide systems.

So there will be more warm-up cost than with a shared cache, but having separate caches gives the least contention. A read-write lock might be a happy medium, if we want to investigate that together.

NHDaly commented 3 years ago

Lemme copy over the discussion from our internal (closed source) repo:

We talked about using our internal SynchronizedCache (which is basically just a wrapper around a dict and a ReentrantLock, like you're doing here), and @comnik pointed out the issue i mentioned above.

@rai-nhdaly (this is my work account):

I think we should:

Report the issue on TimeZones.jl, and mention that we have a SynchronizedCache that is perfect for this, which we're in the middle of open sourcing.

open source our multithreading utilities, which wouldn't be very hard, but I estimate might take about a day.

Send a PR to fix this issue if they haven't already, by switching their dict to a SynchronizedCache.

@comnik:

@rai-nhdaly after a discussion with Cliff and Sacha about related issues, I think SynchronizedCache is not well suited to this use case, because it synchronizes readers as well. This looks to be something that we're doing for each row in a file during ingest, so acquiring a lock for something that (I think) will virtually always be a successful lookup of the cached timezone seems a bit prohibitive.

What they want is rather something like Java's concurrent map. Something like that doesn't yet seem to exist for Julia, right?

@rai-nhdaly:

Mmm I see. The other pattern that we've used for caches like this is to just replicate via thread-local caches, so that you have one cache per thread, so that you never need to lock. That's probably the best thing to do for a situation like this?

I don't know if anyone has that pattern encapsulated into a data structure, but it's quite similar to the pattern we used for our thread-local caches in Salsa (though those ended up being fairly complicated as we added more details that they probably don't need).

What does Java's concurrent map do that would be better in this situation? does it somehow not lock on accesses?

@comnik:

Ah, the thread-local option is a good idea, thanks!

I'm not sure about the details, but concurrent collections in Java land do some combination of locking at a higher granularity, read-write locks, or granular atomic operations. So most of the time readers would indeed not block, unless they're trying to read a key that is being written.

Our approach with a single lock per data structure is referred to as "synchronized" collections, because any concurrent uses (read or write) are synchronized.

@rai-nhdaly:

Gotcha, cool! It'd be nice to switch SynchronizedCache to use read/write locks!

Invenia has one we could look at and potentially use: https://github.com/invenia/ReadWriteLocks.jl - apparently it's a translation from the java source for read/write locks, which is encouraging!

EDIT: It's not actually registered, so it might not be very robust... but we could potentially help with that i guess... EDIT AGAIN: But reading through the source it seems quite straightforward!

@rai-nhdaly:

I have filed an issue on the DateTimes repo with our suggested fix. JuliaTime/TimeZones.jl#342

NHDaly commented 3 years ago

Unfortunately we still haven't gotten around to open-sourcing our multithreading utilities, which we really should do :(

omus commented 3 years ago

The problem is that with this approach, only one task can access the cache at a time, which introduces a lot of contention

@NHDaly please correct me if I am wrong but the way I think I've set this up should only result in lock contention in the case of a cache miss. In the case of a cache hit no locking is required.

NHDaly commented 3 years ago

Ah, i hadn't actually read through your code yet 😅

I don't think that will work, since concurrent reads and writes of the dictionary itself also need to be locked: if e.g. someone is in the middle of looking up a value in the dictionary when someone else inserts a new value into the dictionary, and that new insert causes the hash table to be resized, they may end up reading garbage and/or it could crash.

All accesses to the dictionary need to be synchronized, including cache reads.

If we used a read/write lock, then you could do something more like what you have here, where at least all of the reads could proceed in parallel until there is a pending write.

Does that match your understanding?

omus commented 3 years ago

I did a bunch of experimentation here to cover the essentials. One promising avenue was using a readers-writer lock which allows for concurrent read-only access and exclusive write access. I implemented both versions as mentioned on the Wikipedia page and found the main bottleneck to be the usage of locks at all when working with concurrent readers (should be used the most). In an attempt to avoid that problem used an atomic only implementation which was faster (there does seem to be a race-condition still in my implementation) but wasn't quite as fast as #344.

Here are some benchmark results:

using Base.Threads, TimeZones, BenchmarkTools

function clear_cache()
    if isdefined(TimeZones, :_tz_cache_init)
        TimeZones._tz_cache_init()
    else
        empty!(TimeZones.TIME_ZONE_CACHE)
    end
end

function f(v)
    @threads for i in 1:length(v)
        v[i] = TimeZone("America/Winnipeg")
    end
    return v
end

function g(v)
    names = timezone_names()
    @threads for i in 1:length(v)
        name = rand(names)
        tz = TimeZone(name, TimeZones.Class(:ALL))
        @assert tz.name == name
        v[i] = tz
    end
    return v
end

ReadersWriterLock implementation using only atomics:

julia> clear_cache(); @btime f($(Vector{TimeZone}(undef, 100)));
  14.000 μs (320 allocations: 12.52 KiB)

julia> clear_cache(); @btime f($(Vector{TimeZone}(undef, 1000)));
  101.541 μs (3510 allocations: 118.62 KiB)

julia> clear_cache(); @btime f($(Vector{TimeZone}(undef, 10000)));
  1.061 ms (39510 allocations: 1.21 MiB)

julia> clear_cache(); @btime g($(Vector{TimeZone}(undef, 100)));
  2.256 ms (7266 allocations: 515.02 KiB)

julia> clear_cache(); @btime g($(Vector{TimeZone}(undef, 1000)));
  2.335 ms (11355 allocations: 658.98 KiB)

julia> clear_cache(); @btime g($(Vector{TimeZone}(undef, 1000)));  # race-condition

Implemention from #344

julia> clear_cache(); @btime f($(Vector{TimeZone}(undef, 100)));
  7.625 μs (321 allocations: 12.55 KiB)

julia> clear_cache(); @btime f($(Vector{TimeZone}(undef, 1000)));
  30.500 μs (3509 allocations: 118.59 KiB)

julia> clear_cache(); @btime f($(Vector{TimeZone}(undef, 10000)));
  273.083 μs (39510 allocations: 1.21 MiB)

julia> clear_cache(); @btime g($(Vector{TimeZone}(undef, 100)));
  2.245 ms (7266 allocations: 514.94 KiB)

julia> clear_cache(); @btime g($(Vector{TimeZone}(undef, 1000)));
  2.340 ms (11355 allocations: 659.00 KiB)

julia> clear_cache(); @btime g($(Vector{TimeZone}(undef, 10000)));
  3.092 ms (56355 allocations: 2.12 MiB)

NHDaly commented 3 years ago

Thanks Curtis! Great analysis. In #344, we're trading off time for space, but i think for a small, bounded-size cache like this one, that's a reasonable tradeoff. 👍

JuliaTime / TimeZones.jl

Thread-safe time zone cache via ReentrantLock #345

Codecov Report