Open alex-s-gardner opened 4 days ago
That's odd. Can you test a similar thing with just DataFrames? Since your adding of the new column is a DataFrame specific thing, GeoDataFrames doesn't play a role there anymore.
The shared file has been deleted, so I don't know what RiverIDTrace
is? Is it also a column of vectors? Can you do an elementwise comparison, and check why foo1 and foo2 are not equal anymore?
Worst case, you generate a lot of memory pressure with your vector of vectors, and something is garbage collected. Then again, you also say deepcopy doesn't work, so something else is happening (or deepcopy on DataFrames is not correctly implemented).
RiverIDTrace is a column of Vector{Int}, yes. The values inside were randomly overwritten with zeros, seemingly no correlation to row number. (I don't currently have access to the file but saw the bug being reproduced)
@evetion @asinghvi17 please see updated path to file. I am working on a DataFrames only replication of the issue but have not succeeded yet. I will keep working at it.
I tried a similar thing in pure DataFrames, and it does not pose an issue there:
https://github.com/yeesian/ArchGDAL.jl/blob/a322ce6eb8a811b6ec053608c95c385464214d92/src/ogr/feature.jl#L345 looks to be where int arrays are moved from GDAL to Julia.
It looks like this is an unsafe_wrap
, but without own = true
(which defaults to false). I'm now going to see if setting own = true
changes anything here. Maybe the GDAL dataset scoping also contributes to this, but I don't imagine so...
yeesian/ArchGDAL.jl@
a322ce6
/src/ogr/feature.jl#L345 looks to be where int arrays are moved from GDAL to Julia.It looks like this is an
unsafe_wrap
, but withoutown = true
(which defaults to false). I'm now going to see if settingown = true
changes anything here. Maybe the GDAL dataset scoping also contributes to this, but I don't imagine so...
Good catch! That's the culprit, and kudos for @alex-s-gardner for actually spotting it in real life (sorry for that). But you shouldn't own=true
, as:
the field value. This list is internal, and should not be modified, or freed. Its lifetime may be very brief. If *pnCount is zero on return the returned pointer may be NULL or non-NULL. (https://gdal.org/en/latest/doxygen/classOGRFeature.html)
So the fix would be to at least copy
the unsafe_wrap
.
Ah, I missed that we were wrapping the pointer returned directly. Yeah in that case copy
seems like the way to go.
Just for my satisfaction, and to document this, copy
does make the array robust to any underlying mutation of the parent pointer's memory.
julia> A = rand(10)
10-element Vector{Float64}:
0.8487556697809062
0.4530489648028254
0.7915015228101486
0.5249905434671536
0.011043884362292533
0.8092927336663542
0.2807079717859139
0.5462812200563412
0.7293837731721518
0.8515677666121682
julia> Ap = pointer(A)
Ptr{Float64} @0x00000001ccca4f70
julia> Au = unsafe_wrap(Vector{Float64}, Ap, size(A))
10-element Vector{Float64}:
0.8487556697809062
0.4530489648028254
0.7915015228101486
0.5249905434671536
0.011043884362292533
0.8092927336663542
0.2807079717859139
0.5462812200563412
0.7293837731721518
0.8515677666121682
julia> Auc = copy(Au)
10-element Vector{Float64}:
0.8487556697809062
0.4530489648028254
0.7915015228101486
0.5249905434671536
0.011043884362292533
0.8092927336663542
0.2807079717859139
0.5462812200563412
0.7293837731721518
0.8515677666121682
julia> Auc[1] = 1
1
julia> A
10-element Vector{Float64}:
0.8487556697809062
0.4530489648028254
0.7915015228101486
0.5249905434671536
0.011043884362292533
0.8092927336663542
0.2807079717859139
0.5462812200563412
0.7293837731721518
0.8515677666121682
julia> Au
10-element Vector{Float64}:
0.8487556697809062
0.4530489648028254
0.7915015228101486
0.5249905434671536
0.011043884362292533
0.8092927336663542
0.2807079717859139
0.5462812200563412
0.7293837731721518
0.8515677666121682
julia> Auc
10-element Vector{Float64}:
1.0
0.4530489648028254
0.7915015228101486
0.5249905434671536
0.011043884362292533
0.8092927336663542
0.2807079717859139
0.5462812200563412
0.7293837731721518
0.8515677666121682
julia> Au[1] = 1
1
julia> Au
10-element Vector{Float64}:
1.0
0.4530489648028254
0.7915015228101486
0.5249905434671536
0.011043884362292533
0.8092927336663542
0.2807079717859139
0.5462812200563412
0.7293837731721518
0.8515677666121682
julia> A
10-element Vector{Float64}:
1.0
0.4530489648028254
0.7915015228101486
0.5249905434671536
0.011043884362292533
0.8092927336663542
0.2807079717859139
0.5462812200563412
0.7293837731721518
0.8515677666121682
So just for my own edification, why did deepcopy not prevent this issue?
It could be that GDAL overwrote the memory before / during the deepcopy
, so what deepcopy
saw was already incorrect.
This one caught me off guard. Large tables seem to be unsafe when manipulating this example geo parquet file (using GeoDataFrames v0.3.10 with Julia v"1.11.1"):
https://drive.google.com/file/d/1FJUbk_Smj3VoMhGeR790AtEEaEwZPiFY/view?usp=sharing
In this case adding a new column with 'vector length = 100' does not modify existing columns
adding a large vector, 'vector length = 1000000', DOES MODIFY existing columns
adding
deepcopy
fixes the problem in this instance but after more testingdeepcopy
does not work in all cases