Open bkamins opened 3 years ago
As a reference. What I am aiming to achieve is to have things like https://discourse.julialang.org/t/the-state-of-dataframes-jl-h2o-benchmark/43081/21 working with just Julia Base as much as possible.
have a special representation of short strings that would be non-allocating
If I understand the main issues, this would not fix it, since in a Vector{String}
the GC would still have to look at every element to see which ones are references. It would save memory but not marking time. I can imagine fancy things like keeping a flag in the Vector to track whether every element is short, but that is probably too special-case and we'd be better off doing general card-marking.
track whether every element is short
This is what I have imagined. However, I understand your point of not over-specializing code. Also saving memory will also help GC as it will occur less frequently.
Let me give an example of the pattern in which I want to ensure GC does not get super slow:
Vector{String}
; for simplicity assume all its entries are unique; the vector has 10^9 elements; call this vector x
d = Dict(x .=> axes(x, 1))
d
but eventually we drop itx
is still alive, but d
needs to be garbage collected as it is large and GC needs to reclaim the memory occupied by it(this is not an artificial example - something like this happens e.g. in when doing joins to create left to right table mapping which is needed only temporarily, but can be large)
In such a scenario I would prefer to avoid high cost of GC due to the fact that we have 10^9 strings that potentially have to be marked (or at least to have to pay this cost very infrequently, and most of the time not have to pay it).
Also - what I am advocating - is that if ensuring this would mean that String
type gets a special treatment by GC then that the core Julia team could consider allowing for specially handling this case. Thank you!
A related question is: if we have a collection of Symbol
- would it be possible to avoid full GC scan of its entries (the same when eltype
of the collection is e.g Union{Symbol, Missing}
)? The issue is that in such cases it is clear by design that marking is not required however we have (fresh Julia session):
julia> x = Symbol.(1:10^7);
julia> GC.gc(true);
julia> @time GC.gc(true);
0.069542 seconds (99.98% gc time)
julia> @time GC.gc(true);
0.071290 seconds (99.99% gc time)
which to my understanding shows that every element of x
is investigated by GC.
I am opening this issue as a follow-up to a discussion on Slack not to loose track of it.
Rationale: in data-science workflows it is very common to have very large tables that hold columns that consist of many unique strings (e.g. product ID that is non-numeric character sequence).
In such cases the current design of combination of GC and
String
type cause us to create a lot of small strings. The effect is described e.g. here https://github.com/h2oai/db-benchmark/issues/210 (in these benchmarks the high-count string column is not taking part in any computations - it just sits there using-up memory and causing GC strain). The issue is especially apparent in multi-threading contexts (i.e. when the operation you want to do is parallelized and fast in general, but is paused by triggered GC collection cycles).I think - given we want Julia to be fast in data science workflows - this issue critically needs to be resolved (it is apparent in H2O benchmarks, but I get this problem constantly reported by users of DataFrames.jl).
As this issue touches deep Julia Base internals, I am probably not the best person to decide what should be done (as there are for sure many considerations that have to be made before making a decision), but once the decision on what to do is made I can help implementing the changes (unless of course core devs would be willing to do them). Here is a list of options I can see (some of them might immediately make no sense for Julia core devs - in such case please comment, but I do not want to limit myself at this stage of thinking about the issue):
String
type in GC (related to the above, but we might e.g. decide to always treatString
as very old; possibly this could be enabled/disabled by some run-time option)String
interning (thus fully disabling GC for them when interning is on) - this would have an additional benefit of faster comparisons at the expense of creation timeIn the mean time @quinnj is working on improving the handling of this issue on CSV.jl side (to avoid allocation of strings at all), but I think it is kind of a second-best and we should have a good solution in Julia Base.