JuliaStats / NullableArrays.jl

DEPRECATED Prototype of the new JuliaStats NullableArrays package
Other
35 stars 21 forks source link

`copy` and `deepcopy` do not copy the `parent` field #199

Closed spurll closed 7 years ago

spurll commented 7 years ago

I noticed this issue when sorting a DataTable that contained a column of type NullableArray{WeakRefString{UInt8}}. Ultimately sort calls copy, and after the smoke cleared I was left with WeakRefStrings that would occasionally become gibberish because garbage collection happened and there were no longer any open references to the block of data used by the WeakRefStrings.

I'm putting together a PR that I hope will fix the issue with copy, but the issues with deepcopy (if you'll excuse the expression) run deeper.

As the parent field is really just a reference, we don't want a deep copy of it. Currently, if you deepcopy, you just get an empty parent, which won't prevent garbage collection of any unsafe references (like WeakRefStrings). If we wanted deepcopy to work in this case, it might end up looking something like this:

function Base.deepcopy{T, N}(X::NullableArray{T, N})
    return NullableArray{T, N}(deepcopy(X.values), deepcopy(X.isnull), X.parent)
end

In addition to not really being a deep copy, this has the unfortunate consequence that if you're copying a NullableArray that doesn't care about its parent field (which seems to be most of them), you end up with an original and a copy that share a reference to a single parent (which is just going to be an empty Vector{UInt8}). It's maybe not a really big deal, but it certainly seems wrong.

So my plan is to attempt to fix copy, and to leave deepcopy alone. I don't know if the long-term plan for NullableArrays is to keep parent around (my understanding is that it is not), but the fact that it isn't being copied is causing problems for us at the moment.

spurll commented 7 years ago

Here's an illustration of the issue that I'm talking about:

julia> str = "something"
"something"

julia> arr = [WeakRefString(pointer(str, 1), 4, 1), WeakRefString(pointer(str, 5), 5, 5)]
2-element Array{WeakRefString{UInt8},1}:
 "some" 
 "thing"

julia> na = NullableArray{WeakRefString{UInt8}, 1}(arr, [false, false], Vector{UInt8}(str))
2-element NullableArrays.NullableArray{WeakRefString{UInt8},1}:
 "some" 
 "thing"

julia> scp = copy(na)
2-element NullableArrays.NullableArray{WeakRefString{UInt8},1}:
 "some" 
 "thing"

julia> na.parent
9-element Array{UInt8,1}:
 0x73
 0x6f
 0x6d
 0x65
 0x74
 0x68
 0x69
 0x6e
 0x67

julia> scp.parent
0-element Array{UInt8,1}

Once we no longer have the original na (and str), the WeakRefStrings contained in the copy scp will be in danger of garbage collection.

The fix I'm testing replaces the existing copy method with:

function Base.copy{T, N}(X::NullableArray{T, N})
    return NullableArray{T, N}(copy(X.values), copy(X.isnull), X.parent)
end

With the fix, we can see that copying a NullableArray now results in a copy that has a parent field. Additionally, we can see that the parent field references the same memory block as the original.

julia> str = "something"
"something"

julia> arr = [WeakRefString(pointer(str, 1), 4, 1), WeakRefString(pointer(str, 5), 5, 5)]
2-element Array{WeakRefString{UInt8},1}:
 "some" 
 "thing"

julia> na = NullableArray{WeakRefString{UInt8}, 1}(arr, [false, false], Vector{UInt8}(str))
2-element NullableArrays.NullableArray{WeakRefString{UInt8},1}:
 "some" 
 "thing"

julia> scp = copy(na)
2-element NullableArrays.NullableArray{WeakRefString{UInt8},1}:
 "some" 
 "thing"

julia> na.parent
9-element Array{UInt8,1}:
 0x73
 0x6f
 0x6d
 0x65
 0x74
 0x68
 0x69
 0x6e
 0x67

julia> scp.parent
9-element Array{UInt8,1}:
 0x73
 0x6f
 0x6d
 0x65
 0x74
 0x68
 0x69
 0x6e
 0x67

julia> pointer(na.parent)
Ptr{UInt8} @0x0000000116aa9f98

julia> pointer(scp.parent)
Ptr{UInt8} @0x0000000116aa9f98

julia> pointer(na.values)
Ptr{WeakRefString{UInt8}} @0x0000000109baccd0

julia> pointer(scp.values)
Ptr{WeakRefString{UInt8}} @0x0000000116ac8d50

I'm running some tests now to ensure that this change doesn't have any unexpected repercussions.

spurll commented 7 years ago

getindex also fails to maintain the parent field:

julia> str = "something"
"something"

julia> arr = [WeakRefString(pointer(str, 1), 4, 1), WeakRefString(pointer(str, 5), 5, 5)]
2-element Array{WeakRefString{UInt8},1}:
 "some" 
 "thing"

julia> na = NullableArray{WeakRefString{UInt8}, 1}(arr, [false, false], Vector{UInt8}(str))
2-element NullableArrays.NullableArray{WeakRefString{UInt8},1}:
 "some" 
 "thing"

julia> na.parent
9-element Array{UInt8,1}:
 0x73
 0x6f
 0x6d
 0x65
 0x74
 0x68
 0x69
 0x6e
 0x67

julia> na[:].parent
0-element Array{UInt8,1}

The PR I'm working on also solves this problem. All existing tests still pass on 0.5 and 0.6 (as well as the new ones I've added). Just making sure that I haven't missed anything else important, and that the changes don't interfere with anything else.