SELECT distinct --- Repeat error

NLPchina / elasticsearch-sql

Use SQL to query Elasticsearch

Apache License 2.0

6.99k stars 1.54k forks source link

SELECT distinct --- Repeat error #266

Closed flyingandrunning closed 8 years ago

flyingandrunning commented 8 years ago

SELECT distinct(userId) FROM logstash-api-access-2016.08.17 where ip = '112.65.191.239'

some data repeat,

出现部分的数据重复

shi-yuan commented 8 years ago

You can eliminate duplicates using aggregations

With terms aggregation the results will be grouped by one field

try SELECT userId FROM logstash-api-access-2016.08.17 where ip = '112.65.191.239' group by userId

wanglifengwf commented 8 years ago

@shi-yuan

sql
SELECT distinct(name) FROM user
explan

{
    "from": 0,
    "size": 200,
    "_source": {
        "includes": [
            "name"
        ],
        "excludes": []
    }
}

Distict don't has physical execution plan

ansjsun commented 8 years ago

hi @wanglifengwf i am not sure it support distinct

i am sure it support count(distinct(userId)) ;

you can try it SELECT userId FROM logstash-api-access-2016.08.17 where ip = '112.65.191.239' group by userId

eliranmoyal commented 8 years ago

Yeah distinct is not supported (there is no such feature in elasticsearch) Count distinct is supported(cardinality feature on elasticsearch) I'll make an error when using only distinct on the next version

wanglifengwf commented 8 years ago

@eliranmoyal @ansjsun I'm sorry,I forgot es does not support distinct.thanks!

shi-yuan commented 8 years ago

es support Finding Distinct Counts

Find distinct values, use Terms Aggregation. And you need to pay attention to how that field you want to get distinct values on is analyzed, meaning you need to make sure you're not tokenizing it while indexing, otherwise every entry in the aggregation will be a different term that is part of the field content.

allwefantasy commented 8 years ago

@eliranmoyal @ansjsun group by and distinct sometimes may have the same effect . If distinct used and group by block not found in sql ,we can just automatically add it for user .

SELECT distinct(userId) FROM logstash-api-access-2016.08.17 where ip = '112.65.191.239'

will be translate to :

SELECT userId FROM logstash-api-access-2016.08.17 where ip = '112.65.191.239' group by userId

ansjsun commented 8 years ago

@allwefantasy

or this

select topHits('size'=1) FROM logstash-api-access-2016.08.17 where ip = '112.65.191.239' group by userId