Stratio / cassandra-lucene-index

Lucene based secondary indexes for Cassandra
Apache License 2.0
599 stars 170 forks source link

Problem inserting data with JSON, followed by lucene query #335

Open nlacey opened 7 years ago

nlacey commented 7 years ago

I don't know if this is related to #107.

I'm running cassandra cassandra22-2.2.8-1 I pulled cassandra-lucene-index branch-2.2.8

CREATE TABLE search.search_b ( time_bucket text, id uuid, application text, lucene text, PRIMARY KEY (time_bucket, id) ) WITH CLUSTERING ORDER BY (id DESC) AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE';

CREATE CUSTOM INDEX searchb_search_index ON search.search_b (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = {'refresh_seconds': '1', 'schema': '{
fields : {
application: {type : "string", case_sensitive: false}
}
}'};

if you insert data using normal cqlsh statement everything works But If I use insert JSON, I get a failure when using lucene search

insert into search.search_b JSON '{"application":"test","id":"b668f5af-5cc8-11e7-a8a2-005056a63ab8","time_bucket":"2017-06"}';

select * from search_b where lucene = '{"query":{"type":"contains","field":"application","values":["test"]}}' limit 1; ServerError: com.stratio.cassandra.lucene.IndexException: org.apache.cassandra.serializers.MarshalException: Invalid UTF-8 bytes 59553d98

thanks for any help!
Running on Centos 6.7

ealonsodb commented 7 years ago

Hi @nlacey: I think i know why this is happening. It is an old issue.

When you execute a non partition-directed top-K query, each result row will have a relevance score (how well fits to the query). We used to place that value in the dummy lucene column cell to send that score from the related nodes to coordinator node.

Once a time, cassandra(3.0.5?) started to modify those value. The value you write in one node is not the same as the coordinator reads. This forced us to perform a big refactor(in 3.0.x and 3.x branches).

After that refactor, when a top-k query is performed, those queries are executed in every node against n FSIndexes, Those partial results are returned to coordinator node written to a RAMIndex in coordinator and performed the query again to sort them correctly. Now, there is no need to share relevance scoring between nodes.

This refactor started here and has a lot of edits. I think we must perform this refactor very slow and cautiously.

You say that inserting without jsons is ok. I think i will add a warning in 2.2.x doc and , in the future, with time, we will fix this.

What do you think, @nlacey ?

nlacey commented 7 years ago

the warning sounds good, it isn't a fun problem to debug. we'll be looking at moving to 3.10 where it seems to work.