apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.39k stars 1.26k forks source link

Getting list of distinct non singleFieldValue field using select / distinct #6307

Open HyunjoonCho opened 3 years ago

HyunjoonCho commented 3 years ago

Hello, I am trying to find which values are stored for a specific field. Works okay with singleField ones, but does not with multiples.

With below schema

{
  "schemaName": "metric",
  "dimensionFieldSpecs": [
    {
      "name": "name",
      "dataType": "STRING"
    },
    {
      "name": "tags",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": "null"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "value",
      "dataType": "LONG",
      "defaultNullValue": -1
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "timestampInEpoch",
      "dataType": "LONG",
      "format" : "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ]
}
select name from metric group by name

gives me list of name field in the table

However,

select tags from metric group by tags

returns errorCode 200

[
  {
    "errorCode": 200,
    "message": "QueryExecutionError:\njava.lang.RuntimeException: Caught exception while building data table.\n\tat
org.apache.pinot.core.operator.blocks.InstanceResponseBlock.<init>(InstanceResponseBlock.java:46)\n\tat
org.apache.pinot.core.operator.InstanceResponseOperator.getNextBlock(InstanceResponseOperator.java:37)\n\tat 
org.apache.pinot.core.operator.InstanceResponseOperator.getNextBlock(InstanceResponseOperator.java:26)\n\tat 
org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:49)\n\tat 
org.apache.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:48)\n\tat 
org.apache.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:294)\n\tat 
org.apache.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:215)\n\tat 
org.apache.pinot.core.query.executor.QueryExecutor.processQuery(QueryExecutor.java:60)\n\tat 
org.apache.pinot.core.query.scheduler.QueryScheduler.processQueryAndSerialize(QueryScheduler.java:155)\n\tat 
org.apache.pinot.core.query.scheduler.QueryScheduler.lambda$createQueryFutureTask$0(QueryScheduler.java:139)\n\tat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat 
shaded.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)\n\tat 
shaded.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)"
  }
]
select COUNT(tags) from metric group by tags

gives out count of individual tag - I guess

Is there any workaround for this? Thank you

kishoreg commented 3 years ago

select tags from metric group by tags

This is not valid query - try this

select tags from metric

HyunjoonCho commented 3 years ago

Thanks for the comment,

select tags from metric

Yeah, this lists tags but what I'm trying to do is filtering out duplicate ones.

To illustrate, if there are tags like below:

["tag1", "tag2", "tag3"], ["tag1", "tag2", "tag3"], ["tag1", "tag2"], ["tag1", "tag2"], ["tag1", "tag2", "tag3"], ["tag4"] 

When I do query with above one, it lists all the tags as they are. I want to print out each tags list only once like this:

["tag1", "tag2", "tag3"], ["tag1", "tag2"], ["tag4"] 

Is there a proper query for this? Thank you.

xiangfu0 commented 3 years ago

hmm I don't think the group by behavior is per array based.

One solution we can think of is to add a function like array_concat function(not supported yet) to concat all the values in an array to a string then group by it.(Then ordering in the array will result in different groups ) E.g.

SELECT count(*), array_concat(tags) FROM myTable GROUP BY array_concat(tags)

@npawar can we use groovy to achieve this for now?