Stratio / cassandra-lucene-index

Lucene based secondary indexes for Cassandra
Apache License 2.0
600 stars 171 forks source link

The composite primary key is sorted incorrectly #323

Closed evildecay closed 7 years ago

evildecay commented 7 years ago

Cluster information

Cassandra version  : 3.10
Number of clusters : 5

Test sql data

CREATE KEYSPACE IF NOT EXISTS demo WITH REPLICATION = {'class':'NetworkTopologyStrategy', 'dc1':3};

use demo;

CREATE TABLE test (
    id    text,
    name  text,
    day   date,
    ctime timestamp,
    PRIMARY KEY ((id, day), ctime)
);

CREATE CUSTOM INDEX IF NOT EXISTS test_lucene ON test ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            name : {type:"string"},
            day  : {type:"date", pattern:"yyyy-MM-dd"},
            ctime: {type:"date", pattern:"yyyy-MM-dd HH:mm:ss.SSS"}
        }
    }'
};

insert into test (id, name, day, ctime) values ('1', 'Tom1', '2017-05-30', 1496138801000);
insert into test (id, name, day, ctime) values ('1', 'Tom2', '2017-05-30', 1496138802000);
insert into test (id, name, day, ctime) values ('1', 'Tom3', '2017-05-30', 1496138803000);
insert into test (id, name, day, ctime) values ('1', 'Tom4', '2017-05-30', 1496138804000);
insert into test (id, name, day, ctime) values ('1', 'Tom5', '2017-05-30', 1496138805000);
insert into test (id, name, day, ctime) values ('1', 'Tom6', '2017-05-31', 1496225201000);
insert into test (id, name, day, ctime) values ('1', 'Tom7', '2017-05-31', 1496225202000);
insert into test (id, name, day, ctime) values ('1', 'Tom8', '2017-05-31', 1496225203000);
insert into test (id, name, day, ctime) values ('1', 'Tom9', '2017-05-31', 1496225204000);

select * from test where id = '1' and day in ('2017-05-30', '2017-05-31') and
    expr(test_lucene, '{
        filter:[
            {type: "prefix", field: "name", value: "Tom"}
        ],
        sort:[{field:"ctime", reverse:true}]
    }') limit 10;
id |day        |ctime               |name |
---|-----------|--------------------|-----|
1  |2017-05-30 |2017-05-30 18:06:45 |Tom5 |
1  |2017-05-30 |2017-05-30 18:06:44 |Tom4 |
1  |2017-05-30 |2017-05-30 18:06:43 |Tom3 |
1  |2017-05-30 |2017-05-30 18:06:42 |Tom2 |
1  |2017-05-30 |2017-05-30 18:06:41 |Tom1 |
1  |2017-05-31 |2017-05-31 18:06:44 |Tom9 |
1  |2017-05-31 |2017-05-31 18:06:43 |Tom8 |
1  |2017-05-31 |2017-05-31 18:06:42 |Tom7 |
1  |2017-05-31 |2017-05-31 18:06:41 |Tom6 |

All results should be sorted by query, but it is sorted by day.

ealonsodb commented 7 years ago

Hi @evildecay: Which version of cassandra-lucene-index are you using?

evildecay commented 7 years ago

Hi @ealonsodb: cassandra-lucene-index-3.10.0

ealonsodb commented 7 years ago

Hi @evildecay:

This is not a bug, this is the expected behaviour. Indeed your are executing two different SinglePartitionReadCommand over the same table.

Indeed the sortField is sorting only one-by-one single-partition results. This is a basic test that confirms this:

If you execute:

SELECT * FROM test WHERE id = '1' and day in ('2017-05-30', '2017-05-31') and
    expr(test_lucene, '{
        filter:[
            {type: "prefix", field: "name", value: "Tom"}
        ],
        sort:[{field:"ctime", reverse:true}]
    }') limit 10;

you'll get:

 id | day        | ctime                    | name
----+------------+--------------------------+------
  1 | 2017-05-30 | 2017-05-30 10:06:45+0000 | Tom5
  1 | 2017-05-30 | 2017-05-30 10:06:44+0000 | Tom4
  1 | 2017-05-30 | 2017-05-30 10:06:43+0000 | Tom3
  1 | 2017-05-30 | 2017-05-30 10:06:42+0000 | Tom2
  1 | 2017-05-30 | 2017-05-30 10:06:41+0000 | Tom1
  1 | 2017-05-31 | 2017-05-31 10:06:44+0000 | Tom9
  1 | 2017-05-31 | 2017-05-31 10:06:43+0000 | Tom8
  1 | 2017-05-31 | 2017-05-31 10:06:42+0000 | Tom7
  1 | 2017-05-31 | 2017-05-31 10:06:41+0000 | Tom6

And if you execute it reversed

select * from test where id = '1' and day in ('2017-05-30', '2017-05-31') and
    expr(test_lucene, '{
        filter:[
            {type: "prefix", field: "name", value: "Tom"}
        ],
        sort:[{field:"ctime", reverse:false}]
    }') limit 10;

you'll get:

 id | day        | ctime                    | name
----+------------+--------------------------+------
  1 | 2017-05-30 | 2017-05-30 10:06:41+0000 | Tom1
  1 | 2017-05-30 | 2017-05-30 10:06:42+0000 | Tom2
  1 | 2017-05-30 | 2017-05-30 10:06:43+0000 | Tom3
  1 | 2017-05-30 | 2017-05-30 10:06:44+0000 | Tom4
  1 | 2017-05-30 | 2017-05-30 10:06:45+0000 | Tom5
  1 | 2017-05-31 | 2017-05-31 10:06:41+0000 | Tom6
  1 | 2017-05-31 | 2017-05-31 10:06:42+0000 | Tom7
  1 | 2017-05-31 | 2017-05-31 10:06:43+0000 | Tom8
  1 | 2017-05-31 | 2017-05-31 10:06:44+0000 | Tom9

With this test we can realize that sorting is happening just inside each partition.

I think you can achieve what you want using 'ORDER BY ctime' but it is not posible to combine this with the limit clause:

SELECT * FROM test WHERE id = '1' AND day IN ('2017-05-30', '2017-05-31') 
AND expr(test_lucene, '{
        filter:[
            {type: "prefix", field: "name", value: "Tom"}
        ],
        sort:[{field:"ctime", reverse:false}]
    }') ORDER BY ctime ASC;

 id | day        | ctime                    | name
----+------------+--------------------------+------
  1 | 2017-05-30 | 2017-05-30 10:06:41+0000 | Tom1
  1 | 2017-05-30 | 2017-05-30 10:06:42+0000 | Tom2
  1 | 2017-05-30 | 2017-05-30 10:06:43+0000 | Tom3
  1 | 2017-05-30 | 2017-05-30 10:06:44+0000 | Tom4
  1 | 2017-05-30 | 2017-05-30 10:06:45+0000 | Tom5
  1 | 2017-05-31 | 2017-05-31 10:06:41+0000 | Tom6
  1 | 2017-05-31 | 2017-05-31 10:06:42+0000 | Tom7
  1 | 2017-05-31 | 2017-05-31 10:06:43+0000 | Tom8
  1 | 2017-05-31 | 2017-05-31 10:06:44+0000 | Tom9

OR

SELECT * FROM test WHERE id = '1' AND day IN ('2017-05-30', '2017-05-31') 
AND expr(test_lucene, '{
        filter:[
            {type: "prefix", field: "name", value: "Tom"}
        ],
        sort:[{field:"ctime", reverse:false}]
    }') ORDER BY ctime DESC;

 id | day        | ctime                    | name
----+------------+--------------------------+------
  1 | 2017-05-31 | 2017-05-31 10:06:44+0000 | Tom9
  1 | 2017-05-31 | 2017-05-31 10:06:43+0000 | Tom8
  1 | 2017-05-31 | 2017-05-31 10:06:42+0000 | Tom7
  1 | 2017-05-31 | 2017-05-31 10:06:41+0000 | Tom6
  1 | 2017-05-30 | 2017-05-30 10:06:45+0000 | Tom5
  1 | 2017-05-30 | 2017-05-30 10:06:44+0000 | Tom4
  1 | 2017-05-30 | 2017-05-30 10:06:43+0000 | Tom3
  1 | 2017-05-30 | 2017-05-30 10:06:42+0000 | Tom2
  1 | 2017-05-30 | 2017-05-30 10:06:41+0000 | Tom1

Hope this helps

evildecay commented 7 years ago

Hi @ealonsodb: Thank you for your analytical solution. It seems that cassandra itself is the mechanism of the decision.