JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
116 stars 32 forks source link

Inconsistent Results when Querying Rows #111

Closed Deduction42 closed 3 years ago

Deduction42 commented 3 years ago

I'm trying to read a parquet file and I'm selecting some rows. In one case, I'm reading up to row 40000 from the beginning

cursor = BatchedColumnsCursor(parFile, rows=1:40000, reusebuffer=false, use_threads=false)
DataFrame.( collect(cursor) )[1].Date_Time
40000-element Array{Union{Missing, DateTime},1}:
 2020-10-29T01:16:34
 2020-10-29T01:16:35
 2020-10-29T01:16:36
 2020-10-29T01:16:37
 2020-10-29T01:16:38
 2020-10-29T01:16:39
 2020-10-29T01:16:40
 2020-10-29T01:16:41
 ⋮
 2020-10-29T12:23:06
 2020-10-29T12:23:07
 2020-10-29T12:23:08
 2020-10-29T12:23:09
 2020-10-29T12:23:10
 2020-10-29T12:23:11
 2020-10-29T12:23:12
 2020-10-29T12:23:13

In another case, I start at row 20000 and read to 40000

cursor = BatchedColumnsCursor(parFile, rows=20000:40000, reusebuffer=false, use_threads=false)
 DataFrame.( collect(cursor) )[1].Date_Time
 20001-element Array{Union{Missing, DateTime},1}:
 2020-10-29T01:16:35
 2020-10-29T01:16:36
 2020-10-29T01:16:37
 2020-10-29T01:16:38
 2020-10-29T01:16:39
 2020-10-29T01:16:40
 2020-10-29T01:16:41
 2020-10-29T01:16:42
 ⋮
 2020-10-29T06:49:48
 2020-10-29T06:49:49
 2020-10-29T06:49:50
 2020-10-29T06:49:51
 2020-10-29T06:49:52
 2020-10-29T06:49:53
 2020-10-29T06:49:54
 2020-10-29T06:49:55

The original file has timestamps in ascending order. It looks like the the 20000:40000 row reading is starting almost at the same place as the 1:40000 and they're ending up in entirely different places.