CGnal / spark-opentsdb

6 stars 6 forks source link

Returned data inconsistent with provided "opentsdb.interval" option #4

Open asavartsov opened 7 years ago

asavartsov commented 7 years ago

I'm trying to get specific range of data with spark-opentsdb by providing "opentsdb.interval" but returned data somehow doesn't exactly match interval.

The code I use to insert data

import java.sql.Timestamp
import com.cgnal.spark.opentsdb._

OpenTSDBContext.autoCreateMetrics = true
OpenTSDBContext.saltWidth = 1
OpenTSDBContext.saltBuckets = 4

val csv = spark.sqlContext.read.option("header", "true").option("inferSchema", "true").csv("hdfs:///adr.csv")

val data = csv.map(row => DataPoint("test", row.getAs[Timestamp]("Time").getTime, row.getAs[Double]("Value"), Map("tag" -> "value")))

data.rdd.toDF(spark).write.mode("append").opentsdb

adr.csv contains data from 2017-02-02T09:20:00.000Z (12:20:00 at my timezone, GMT+3) to 2017-02-02T10:20:00.000Z (13:20:00 at my timezone)

The code I use to read data

import org.apache.spark.sql.functions._

import java.sql.Timestamp
import com.cgnal.spark.opentsdb._

import spark.sqlContext.implicits._

OpenTSDBContext.saltWidth = 1
OpenTSDBContext.saltBuckets = 4

val readFrom = new Timestamp(1486026162527L)
val readTo = new Timestamp(1486030255027L)

val interval = s"${readFrom.getTime / 1000}:${readTo.getTime / 1000}"

val adr = spark.sqlContext
  .read
  .options(Map("opentsdb.metric" -> "test", "opentsdb.interval" -> interval))
  .opentsdb
  .orderBy($"timestamp".asc)

z.show(adr)

Results are

image

image

I'm trying to read data from around 12:02 local time, but results start from 13:00. If I mangle from-to values I can get different ranges but they kind of random. Omitting interval option gives all data.

I run code on Spark 2.1.0, Hadoop 2.6.0-cdh5.10.0, HBase 1.2.0-cdh5.10.0, OpenTSDB 2.3.0 in yarn-client mode in Zeppelin notebook, running in local mode in spark shell gives the same results.

asavartsov commented 7 years ago

adr.zip

My source data file. All values in millisecond resolution.

dgreco commented 7 years ago

Hi Alexey, I'll give it a look, in the test cases I tried to check exactly this kind of scenarios. Sometime managing the timezone can be tricky. BTW, adding adding additional tests definitely helps. I'll let you know David

Sent from my iPhone

On 24 Mar 2017, at 9:33 AM, Alexey Savartsov notifications@github.com wrote:

adr.zip

My source data file. All values in millisecond resolution.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

asavartsov commented 7 years ago

I've discovered a pattern: returning data interval starts at the 'from' boundary rounded to the next hour except cases when 'from' set exactly to 00mm00ss000ms of an hour. For example, if I set from to 12:00:00, everythings ok, if I set it to 12:00:01, I get values starting at 13:00:00 timestamp. 'to' boundary handling works fine. Similarly, on my sample dataset I cat request data from 13:00:00 to 13:00:01, but not in range of 13:00:01 to 13:00:02.

dgreco commented 7 years ago

In fact googling around the opentsdb documentation I understood that the row key is generated from the hour than all the metrics for that hour are in the same row, so somehow the data are organised per hour. So, it shouldn’t be a defect in my implementation right but we should understand how to formulate the right query. You could try to run a similar query passing through the daemon. If you run the daemon on the same hbase instance don’t forget to configure it with the right number of salting buckets. David

On 24 Mar 2017, at 12:21, Alexey Savartsov notifications@github.com wrote:

I've discovered a pattern: returning data interval starts at the 'from' boundary rounded to the next hour except cases when 'from' set exactly to 00mm00ss000ms of an hour. For example, if I set from to 12:00:00, everythings ok, if I set it to 12:00:01, I get values starting at 13:00:00 timestamp. 'to' boundary handling works fine. Similarly, on my sample dataset I cat request data from 13:00:00 to 13:00:01, but not in range of 13:00:01 to 13:00:02.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CGnal/spark-opentsdb/issues/4#issuecomment-288997290, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHuwb8RcJhpb2ytkF646hJQcXeiJ2wzks5ro6c6gaJpZM4Mn4s-.

asavartsov commented 7 years ago

Thanks to pointing out to the cause of the problem. Turns out issue may be fixed quite easily on the library side. See referenced pull request.

dgreco commented 7 years ago

I merged your pull request in the spark-2.x branch