Stratio / cassandra-lucene-index

Lucene based secondary indexes for Cassandra
Apache License 2.0
600 stars 170 forks source link

Lucene with Spark-Cassandra-Connector returns empty results #377

Open junaidnasir opened 6 years ago

junaidnasir commented 6 years ago

We have been really excited by the potential of using the Lucene indexing provided by stratio. We have an IoT platform that is ingesting a timeseries data in to a C* cluster for some time (the 3 node cluster is now having around 1TB of data).

Initially we had some latency issues when querying around our sensor db for windowed operations. Even if we were able to generate keys (per day) that were time-based, the performance over spark SQL was not turning out to be independent of the size of the total data stored.

We now have Lucene indexing enabled on top of it, and direct (cql based) queries to the DB, of the type we are expecting, are extremely fast.

We use the datastax connector that requires the hack for an empty column to be added. However we see that while the query is appropriately filtered by Lucene indexing, Spark gets the data and apparently disregards it and returns a null. same issue as #79 . the thread said it was fixed in 1.6.0 but apparently it still exists in 2.1.0. any help to resolve it would be highly appreciated

Using Lucene: 3.11.0.0 cassandra:3.11.0 datastax:spark-cassandra-connector:2.0.3-s_2.11 spark: 2.1.1

CREATE TABLE alldev.temp (
    devid text,
    day date,
    datetime timestamp,
    lucene text,
    value text,
    PRIMARY KEY ((devid, day), datetime)
)
CREATE CUSTOM INDEX idx ON alldev.temp (lucene) 
  USING 'com.stratio.cassandra.lucene.Index'
  WITH OPTIONS = {
    'refresh_seconds': '1',
    'schema': '{
       fields: {
         devid: {type: "string"},
         day:{type: "date", pattern: "yyyy-MM-dd"},
         datetime:{type: "date"},
         value:{type:"integer"}
        }
    }'
  };

  cqlsh> select * from alldev.temp  where lucene =  '{filter: {type: "range", field: "value", lower: "0"}}' ;

spark plan for same query is shown below, i think problem is filter isnotnull(lucene#3)

18/01/05 11:30:05 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(lucene), EqualTo(lucene,{filter: {type: "range", field: "value",lower: "0"}})]
== Physical Plan ==
*Filter isnotnull(lucene#3)
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@2c9bc921 [devid#0,day#1,datetime#2,lucene#3,value#4] PushedFilters: [IsNotNull(lucene), *EqualTo(lucene,{filter: {type: "range", field: "value",lower: "0"}})], ReadSchema: struct<devid:string,day:date,datetime:timestamp,lucene:string,value:string>
phambryan commented 6 years ago

It's a bad query

SELECT * FROM alldev.temp WHERE expr(idx, '{ filter: { type: "range", field: "value", lower: 0, include_lower: true } }');

xiaohan815 commented 6 years ago

I also found this problem.if my lucene column is not empty,then can return results using spark.so I only can put some unused value to this column.Have any other resolutions?