JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
115 stars 32 forks source link

Timestamp columns are not supported. #49

Closed blairn closed 4 years ago

blairn commented 4 years ago

Hi, it is us again.

So, we are on to the next data type, and this time it is timestamps. Yeah, I know these will be a pain, the type is already in the same test files as the boolean test.

I'm sorry for bugging you again this quickly, we VERY much love the help you already have given. It has made a big difference to us.

blairn commented 4 years ago

The INT96 binary encoding is split into 2 parts: First 8 bytes are nanoseconds since midnight Last 4 bytes is Julian day

To unix seconds I think it is (julian_day - 2440588) 86400 + nanoseconds / (1000 1000 * 1000)

it looks like there is some ENDIAN fun and games as well.

The java code people have used to read it looks like this.

long nanos = Longs.fromBytes(data[7], data[6], data[5], data[4], data[3], data[2], data[1], data[0]); long julianday = Ints.fromBytes(data[11], data[10], data[9], data[8]); long epochmillis = DateTimeUtils.fromJulianDay(julianday) + TimeUnit.NANOSECONDS.toMillis(nanos);

I don't even know how someone gets to a format like this.

gvdr commented 4 years ago

This is one for @tanmaykm and @ViralBShah :-)

ViralBShah commented 4 years ago

@tanmaykm should see it soon (in India time daytime). Please bump again if you don't see activity in a day or so.

tanmaykm commented 4 years ago

I'll be on it today

tanmaykm commented 4 years ago

I have a PR at #50 that provides a method to interpret timestamps.

I tested this on this dataset but the results do not match the corresponding csv format data.

I used some external tools to view the parquet file (e.g. http://parquet-viewer-online.com/) and they seem to match what the Julia code reads though. Could it be that the CSV file needs a correction?

ViralBShah commented 4 years ago

@blairn Bump. Fully understand you guys are busy, but do let us know if this works out. We are happy to stay on top of this over the weekend, and/or get on a video call if that helps iterate faster.

blairn commented 4 years ago

Ok, I've grabbed the decoder and put it into out code, and it works, except there is a timezone thing going on there. I don't know where it records timezone, or how it adjusts for it. We just compared the results we were getting from AWS, to the results from the decoder.

So, we put it in the adjustment by hand, (as a bit of an ugly hack) - however, our country is a single timezone, so we can get away with that.

So we are good on our end. So don't feel like you have to do hurry it along because of us.

Sorry for going quiet, I've been wrapped up in getting the automation pipeline finished, Thanks again for all of your help.

I'll see if I can dig up where it stores timezone, because looking at the spec it shouldn't be able to do that.

blairn commented 4 years ago

So our code looks like....

function logical_timestamp(barr)
    nanos = read(IOBuffer(barr[1:8]), Int64)
    julian_days = read(IOBuffer(barr[9:12]), Int32)
    Dates.julian2datetime(julian_days) + Dates.Nanosecond(nanos) - Dates.Hour(12)
end

function logical_timestamp(i128::Int128)
    iob = IOBuffer()
    write(iob, i128)
    logical_timestamp(take!(iob))
end

function unpack(fname)
   println("unpacking $fname")
   local df = DataFrame(load(fname))
   df[!, :starttime] = map(x -> logical_timestamp(x), df[:, :starttime])
   save("$(fname).csv",df)
end

Just so you can see the hack we used. I can confirm THAT works, at least for us in UTC-12 But where it is storing that it IS GMT -12 I have no idea. The EC2 instances timezone is set to UTC, so it isn't that.

They ran into the same issue with fastparquet, https://github.com/dask/fastparquet/issues/257 apparently, there are metadata on columns in parquet files. (I am learning more about parquet files every day), and they store timezone data there.

I bet you that is what is going on here. Either way, we are fine because of the amazing work you have done. (again, it is huge for us to be able to read these files at all, and since they are all coming from the same source for us, we can get away with some pretty ugly hacks)

tanmaykm commented 4 years ago

There is one parquet metadata format I saw here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

But that didn't seem to be the case in this test dataset though, there doesn't seem to be any LogicalType in the SchemaElement for that column. Also this format talks about timestamps being 64bits, whereas we have 96. But from what it mentions about timezones - it still doesn't seem to give a way to encode timezone, it just gives a way to annotate whether the timezone is UTC or local. What is local is left to the decoder.

The int96 format seems to be a deprecated format, but looks like used until recently by popular tools. I found some interesting discussions in this PR: https://github.com/apache/parquet-format/pull/49. There is no mention of timezone support here. It also refers to the time as epoch time, which probably means it is to be interpreted as UTC?

So one possibility in our test dataset is that the encoder could have converted timestamps to UTC before writing to the file (and it did have the TZ information while doing that). In that case, decoding should do the TZ adjustment while reading back if we need results in local time. That can explain the - Dates.Hour(12) adjustment you had to do.

Does this look like a possiblity? If yes, then maybe accepting an optional timezone parameter in logical_timestamp can be the right way to handle this?

blairn commented 4 years ago

The issue there is the parquets handed to Spark to unpack gets the timezone right.

So it obviously has something IN the parquet itself letting it know which timezone it is.

This is what is confusing me.

And yeah, it is a totally dedicated data type, but we can't choose which datatypes we receive.

On Wed, Apr 15, 2020 at 1:19 PM Tanmay Mohapatra notifications@github.com wrote:

There is one parquet metadata format I saw here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

But that didn't seem to be the case in this test dataset though, there doesn't seem to be any LogicalType https://github.com/JuliaIO/Parquet.jl/blob/d910f57825db3edd7919e737bc1f21fe86e4e847/src/PAR2/PAR2_types.jl#L192 in the SchemaElement for that column. Also this format talks about timestamps being 64bits, whereas we have 96. But from what it mentions about timezones - it still doesn't seem to give a way to encode timezone, it just gives a way to annotate whether the timezone is UTC or local. What is local is left to the decoder.

The int96 format seems to be a deprecated format, but looks like used until recently by popular tools. I found some interesting discussions in this PR: apache/parquet-format#49 https://github.com/apache/parquet-format/pull/49. There is no mention of timezone support here. It also refers to the time as epoch time, which probably means it is to be interpreted as UTC?

So one possibility in our test dataset is that the encoder could have converted timestamps to UTC before writing to the file (and it did have the TZ information while doing that). In that case, decoding should do the TZ adjustment while reading back if we need results in local time. That can explain the - Dates.Hour(12) adjustment you had to do.

Does this look like a possiblity? If yes, then maybe accepting an optional timezone parameter in logical_timestamp can be the right way to handle this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaIO/Parquet.jl/issues/49#issuecomment-613761363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADX6FIJ6CHWE6Q763VPADTRMUDS3ANCNFSM4MDQ6ETQ .

tanmaykm commented 4 years ago

Hmm... I haven't looked into the key-value metadata fields in ColumnMetaData or FileMetaData. Maybe Spark keeps something there.

blairn commented 4 years ago

I'm guessing it is in ColumnMetaData.

This is a mess, because it was never really meant to be a real format I think. This is what people used before timestamp_ms existed.

FastParquet the Python library looks there for the format.

And they started looking there, because that is where Arrow looks.

tanmaykm commented 4 years ago

This seems to be the way Spark deals with it: https://github.com/apache/spark/pull/19769. Looks like it is a configuration on Spark, with the TZ being picked up from the spark session (and not host machine TZ).

Also ref: https://issues.apache.org/jira/browse/SPARK-12297 for a discussion around it

blairn commented 4 years ago

This is the unpacking code. (with the s3 bucket and ojbect name changed)

import com.amazonaws.services.glue.{DynamicRecord, GlueContext}
import org.apache.spark.SparkContext
import com.amazonaws.services.glue.util.JsonOptions

object Unpacker {
  def main(args: Array[String]): Unit = {
    val sc: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(sc)

    var df = glueContext.getCatalogSource(database = "incoming", tableName = "daily").getDynamicFrame()
    glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> "s3://ourdata/out/daily", "compression" -> "gzip")), format = "csv").writeDynamicFrame(df)
  }
}

The job doesn't have any timezones set.

I don't know where it could be picking it up if it isn't from the parquet.

Time to break out the hexeditor.

blairn commented 4 years ago

Either way, I guess handing it a timezone or an offset is a good plan.

Even if it could be detected from the Parquet (and it likely can't I guess), supplying an offset or post processing the offset (like we do) is likely better.

tanmaykm commented 4 years ago

Updated #50 to add an optional offset keyword parameter to logical_timestamp