Make String more GC friendly

bkamins commented 3 years ago

I am opening this issue as a follow-up to a discussion on Slack not to loose track of it.

Rationale: in data-science workflows it is very common to have very large tables that hold columns that consist of many unique strings (e.g. product ID that is non-numeric character sequence).

In such cases the current design of combination of GC and String type cause us to create a lot of small strings. The effect is described e.g. here https://github.com/h2oai/db-benchmark/issues/210 (in these benchmarks the high-count string column is not taking part in any computations - it just sits there using-up memory and causing GC strain). The issue is especially apparent in multi-threading contexts (i.e. when the operation you want to do is parallelized and fast in general, but is paused by triggered GC collection cycles).

I think - given we want Julia to be fast in data science workflows - this issue critically needs to be resolved (it is apparent in H2O benchmarks, but I get this problem constantly reported by users of DataFrames.jl).

As this issue touches deep Julia Base internals, I am probably not the best person to decide what should be done (as there are for sure many considerations that have to be made before making a decision), but once the decision on what to do is made I can help implementing the changes (unless of course core devs would be willing to do them). Here is a list of options I can see (some of them might immediately make no sense for Julia core devs - in such case please comment, but I do not want to limit myself at this stage of thinking about the issue):

improve the "generational" aspect of GC (related: https://github.com/JuliaLang/julia/issues/40644)
have a special handling of String type in GC (related to the above, but we might e.g. decide to always treat String as very old; possibly this could be enabled/disabled by some run-time option)
have a run-time option to turn on/off String interning (thus fully disabling GC for them when interning is on) - this would have an additional benefit of faster comparisons at the expense of creation time
have a special representation of short strings that would be non-allocating (if you have very many strings most likely they are short)

In the mean time @quinnj is working on improving the handling of this issue on CSV.jl side (to avoid allocation of strings at all), but I think it is kind of a second-best and we should have a good solution in Julia Base.

bkamins commented 3 years ago

As a reference. What I am aiming to achieve is to have things like https://discourse.julialang.org/t/the-state-of-dataframes-jl-h2o-benchmark/43081/21 working with just Julia Base as much as possible.

JeffBezanson commented 3 years ago

have a special representation of short strings that would be non-allocating

If I understand the main issues, this would not fix it, since in a Vector{String} the GC would still have to look at every element to see which ones are references. It would save memory but not marking time. I can imagine fancy things like keeping a flag in the Vector to track whether every element is short, but that is probably too special-case and we'd be better off doing general card-marking.

bkamins commented 3 years ago

track whether every element is short

This is what I have imagined. However, I understand your point of not over-specializing code. Also saving memory will also help GC as it will occur less frequently.

Let me give an example of the pattern in which I want to ensure GC does not get super slow:

we have a large Vector{String}; for simplicity assume all its entries are unique; the vector has 10^9 elements; call this vector x
using this vector we create a large temporary object, e.g. d = Dict(x .=> axes(x, 1))
we do some operations using d but eventually we drop it
x is still alive, but d needs to be garbage collected as it is large and GC needs to reclaim the memory occupied by it

(this is not an artificial example - something like this happens e.g. in when doing joins to create left to right table mapping which is needed only temporarily, but can be large)

In such a scenario I would prefer to avoid high cost of GC due to the fact that we have 10^9 strings that potentially have to be marked (or at least to have to pay this cost very infrequently, and most of the time not have to pay it). Also - what I am advocating - is that if ensuring this would mean that String type gets a special treatment by GC then that the core Julia team could consider allowing for specially handling this case. Thank you!

bkamins commented 3 years ago

A related question is: if we have a collection of Symbol - would it be possible to avoid full GC scan of its entries (the same when eltype of the collection is e.g Union{Symbol, Missing})? The issue is that in such cases it is clear by design that marking is not required however we have (fresh Julia session):

julia> x = Symbol.(1:10^7);

julia> GC.gc(true);

julia> @time GC.gc(true);
  0.069542 seconds (99.98% gc time)

julia> @time GC.gc(true);
  0.071290 seconds (99.99% gc time)

which to my understanding shows that every element of x is investigated by GC.

JuliaLang / julia

Make String more GC friendly #40840