apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
285 stars 60 forks source link

Reading multiple file corrupt values and is also order dependent #534

Closed Moelf closed 3 days ago

Moelf commented 1 week ago

flist1 = filter(contains("physics_TLA"), readdir("/data/jiling/TLA/julia_arrows/"; join=true))
flist2 = filter(contains("mRp20"), readdir("/data/jiling/TLA/julia_arrows/"; join=true))

Arrow.Table(flist1).proc |> unique
# 1-element Vector{Bool}:
# 0

Arrow.Table(flist2).proc |> unique
# 1-element Vector{Bool}:
# 1

length(Arrow.Table([flist1;]).proc), length(Arrow.Table([flist2;]).proc)
# (3000521, 10077)

length(Arrow.Table([flist1; flist2]).proc)
# 3010598

Arrow.Table([flist1; flist2]).proc |> unique
#1-element Vector{Bool}:
# 1
Arrow.Table([flist2; flist1]).proc |> unique
#1-element Vector{Bool}:
# 0
quinnj commented 1 week ago

Can you be more descriptive of the issue here? Or provide a more simple/clear way to reproduce the issue? It seems to me that when you're trying to read just a single file as an "array of files" something is going wrong? Is that right?

Moelf commented 1 week ago

Yea looks like the later file is overriding the values of booleans in earlier files. I suspect it's due to some sentinel value merge of some sort.

Will provide two sample files today

Moelf commented 1 week ago

here's the file to reproduce: https://drive.proton.me/urls/1BP55Z7XDR#GSBCZ9hlUijm

julia> Arrow.Table(["d2.feather", "d1.feather"]).proc |> unique
1-element Vector{Bool}:
 0

julia> Arrow.Table(["d2.feather", "d1.feather"]).proc |> unique
2-element Vector{Bool}:
 1
 0

it stochatistically would yield wrong result from time to time

Moelf commented 4 days ago

bump

quinnj commented 4 days ago

Ok, I believe https://github.com/apache/arrow-julia/pull/535 should fix.