evetion / GeoDataFrames.jl

Simple geographical vector interaction built on top of ArchGDAL
https://www.evetion.nl/GeoDataFrames.jl/dev/
MIT License
68 stars 6 forks source link

!!!! erroneous table mutation when adding a large column !!!! #86

Open alex-s-gardner opened 4 days ago

alex-s-gardner commented 4 days ago

This one caught me off guard. Large tables seem to be unsafe when manipulating this example geo parquet file (using GeoDataFrames v0.3.10 with Julia v"1.11.1"):

https://drive.google.com/file/d/1FJUbk_Smj3VoMhGeR790AtEEaEwZPiFY/view?usp=sharing

In this case adding a new column with 'vector length = 100' does not modify existing columns

using GeoDataFrames
vector_length = 100
test = GeoDataFrames.read("test.parquet")
foo1 = copy(test.RiverIDTrace[1])

test[!, :new_column] = [zeros(vector_length ) for r in 1:nrow(test)]#
foo2 = copy(test.RiverIDTrace[1])
println(foo1 == foo2)
true

adding a large vector, 'vector length = 1000000', DOES MODIFY existing columns

using GeoDataFrames
vector_length = 1000000
test = GeoDataFrames.read("test.parquet")
foo1 = copy(test.RiverIDTrace[1])

test[!, :new_column] = [zeros(vector_length ) for r in 1:nrow(test)]#
foo2 = copy(test.RiverIDTrace[1])
println(foo1 == foo2)
false

adding deepcopy fixes the problem in this instance but after more testing deepcopy does not work in all cases

using GeoDataFrames
vector_length = 1000000
test = deepcopy(GeoDataFrames.read("test.parquet"))
foo1 = copy(test.RiverIDTrace[1])

test[!, :new_column] = [zeros(vector_length ) for r in 1:nrow(test)]#
foo2 = copy(test.RiverIDTrace[1])
println(foo1 == foo2)
true
evetion commented 4 days ago

That's odd. Can you test a similar thing with just DataFrames? Since your adding of the new column is a DataFrame specific thing, GeoDataFrames doesn't play a role there anymore.

evetion commented 4 days ago

The shared file has been deleted, so I don't know what RiverIDTrace is? Is it also a column of vectors? Can you do an elementwise comparison, and check why foo1 and foo2 are not equal anymore?

Worst case, you generate a lot of memory pressure with your vector of vectors, and something is garbage collected. Then again, you also say deepcopy doesn't work, so something else is happening (or deepcopy on DataFrames is not correctly implemented).

asinghvi17 commented 3 days ago

RiverIDTrace is a column of Vector{Int}, yes. The values inside were randomly overwritten with zeros, seemingly no correlation to row number. (I don't currently have access to the file but saw the bug being reproduced)

alex-s-gardner commented 3 days ago

@evetion @asinghvi17 please see updated path to file. I am working on a DataFrames only replication of the issue but have not succeeded yet. I will keep working at it.

asinghvi17 commented 3 days ago

I tried a similar thing in pure DataFrames, and it does not pose an issue there:

MWE ```julia using DataFrames df = DataFrame(); int_data = [rand(Int, rand(1:1_000)) for i in 1:215_000] df.col1 = deepcopy(int_data) df.col2 = [zeros(1000000) for i in 1:size(df, 1)] all(df.col1 .== int_data) # true ```

https://github.com/yeesian/ArchGDAL.jl/blob/a322ce6eb8a811b6ec053608c95c385464214d92/src/ogr/feature.jl#L345 looks to be where int arrays are moved from GDAL to Julia.

It looks like this is an unsafe_wrap, but without own = true (which defaults to false). I'm now going to see if setting own = true changes anything here. Maybe the GDAL dataset scoping also contributes to this, but I don't imagine so...

evetion commented 3 days ago

yeesian/ArchGDAL.jl@a322ce6/src/ogr/feature.jl#L345 looks to be where int arrays are moved from GDAL to Julia.

It looks like this is an unsafe_wrap, but without own = true (which defaults to false). I'm now going to see if setting own = true changes anything here. Maybe the GDAL dataset scoping also contributes to this, but I don't imagine so...

Good catch! That's the culprit, and kudos for @alex-s-gardner for actually spotting it in real life (sorry for that). But you shouldn't own=true, as:

the field value. This list is internal, and should not be modified, or freed. Its lifetime may be very brief. If *pnCount is zero on return the returned pointer may be NULL or non-NULL. (https://gdal.org/en/latest/doxygen/classOGRFeature.html)

So the fix would be to at least copy the unsafe_wrap.

asinghvi17 commented 3 days ago

Ah, I missed that we were wrapping the pointer returned directly. Yeah in that case copy seems like the way to go.

Just for my satisfaction, and to document this, copy does make the array robust to any underlying mutation of the parent pointer's memory.


julia> A = rand(10)
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Ap = pointer(A)
Ptr{Float64} @0x00000001ccca4f70

julia> Au = unsafe_wrap(Vector{Float64}, Ap, size(A))
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Auc = copy(Au)
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Auc[1] = 1
1

julia> A
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Au
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Auc
10-element Vector{Float64}:
 1.0
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Au[1] = 1
1

julia> Au
10-element Vector{Float64}:
 1.0
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> A
10-element Vector{Float64}:
 1.0
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682
alex-s-gardner commented 3 days ago

So just for my own edification, why did deepcopy not prevent this issue?

asinghvi17 commented 3 days ago

It could be that GDAL overwrote the memory before / during the deepcopy, so what deepcopy saw was already incorrect.