Support paging in SCLI - Githubissues

zengzh commented 7 years ago

Hi @ealonsodb

ElasticSearch accepts “from” and “size” parameters so that users can retrieve certain number of results starting from a particular position. https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html

Does SCLI have this feature? For example, can I issue a query as follows:

SELECT * FROM tweets WHERE expr(tweets_index, '{
query: {type: "match", field: "body", value: "FIFA"},
**limit：{offset:"100", pagesize："100"}**
}');

Which retrieves the tweets about FIFA that are returned in 100 tweets/page and skip the first 100 tweets? If not, does stratio folks have plan to support this? Thanks.

ealonsodb commented 7 years ago

Hi @zengzh:

As stated in doc SCLI supports CQL paging.

In your use case, the match query acts as a 'boolean' relevance (it matches or not) query. It does not make sense to sort them by relevance. Maybe searching documentation should help you to understand this.

Hope this helps

zengzh commented 7 years ago

Thanks @ealonsodb for quick reply.

Sorry for the inappropriate example. Maybe a better one is the following:

SELECT * FROM tweets WHERE expr(tweets_index, '{
query: {type: "phrase", field: "body", value: "big data gives organizations"},
**limit：{offset:"20", pagesize："80"}**
}');

According to CQL paging, paging on displays query results in 100-line chunks followed by the more prompt. This functionality is limited in 2 aspects:

The page size is fixed (pagesize: "80" in the example)
No way to specify the number of results for skip (offset: "20" in the example)

Any ways to break the above limitations?

ealonsodb commented 7 years ago

Execute PAGING 50 in cqlsh and see what happens!! Indeed the 100 page size is a cqlsh.py variable you can change

The query you are executing is a relevance query, so results from different cassandra nodes must be sorted in coordinator node. What i mean, even providing an offset, there is no way to know the starting point in each node data subset(so, it is compulsory to execute that first page query and discard those results).

Paging functionality is covered by CQL paging and you can very easily skip whatever results you want in client.

Hope this helps

zengzh commented 7 years ago

Thanks @ealonsodb It surprises me that the official document does not mention page size can be customized.

Cassandra supports paging but does not encourage offset queries .

I understand that even providing an offset in SCLI, it still needs to compute the first page and discard those results (keys). But, this avoids to retrieve the whole set of tuples from Cassandra and discard them. To this point of view, computing and discarding results from SCLI instead of computing/discarding tuples from Cassandra is helpful, right?

ealonsodb commented 7 years ago

Hi @zengzh: You are totally right. Thank you for changing our mind about this feature. We have coded in #342 Could you please take a look?

zengzh commented 7 years ago

Thanks @ealonsodb. I see that you mentioned skip "is not compatible with paging or top-K queries". Can you explain why is that? Did you add any validation check? If so, what it is?

ealonsodb commented 7 years ago

Hi @zengzh: The main problem with paging and topK queries is that cassandra resolve inconsistencies between different nodes data in coordinator after any 2i related functions. If the 2i skips some rows, deterministic behaviour(to see the same results in the same order in different executions of the same query) may be lost.

Hope this helps

zengzh commented 7 years ago

Hi @ealonsodb： Sorry that I do not fully understand. What are 2i related functions? Can you give an example of paging or top-k queries that return non-deterministic results because of skip? If I specify the sorting field, will the results still be non-deterministic? Thanks very much!

ealonsodb commented 7 years ago

Hi @zengzh:

2i is the acronym for secondary index in cassandra. This is the unique contact point between cassandra and our product. Our implementation of Index interface. We are lock to that cassandra Index implementation.

When querying our product you can use query or filter.

When you use query you are asking for the most fitted rows that match your query.
If you use filter it is just give me any that match the query, not sorted
Filter plus sort acts exactly the same as a query.

There is plenty of information at internet searching by: "lucene query versus filter".

The main problem is that data consistency in executed after 2i related sorting postProcess. The second problem is that this case is strange and does not happen in stable cluster. What i mean here is that skip would works well if every node is up and data consistency between nodes is correct but will start to fail if there are some data inconsistencies.

Hope this helps

zengzh commented 7 years ago

Thanks @ealonsodb So better resolve inconsistencies before using skip.

ealonsodb commented 7 years ago

Hi @zengzh Woaw. i have never thinked about it in that way. Give some time to test it deeply and maybe, with a big experimental warning about it I will merge.

Thank you for change my mind

zengzh commented 7 years ago

Thanks @ealonsodb May I know when this skip feature will be merged into the release version?

zengzh commented 7 years ago

Hi @ealonsodb @adelapena ,

It has been a while since this feature had been developed but remained unreleased. May I know the latest status and when it will be available?

Look forward to your reply. Many thanks.

feruud-sr commented 6 years ago

Hi @ealonsodb @adelapena, do you have plans to merge this feature soon? we are excited and impatient about this, Thank you a lot! 😬

Stratio / cassandra-lucene-index

Support paging in SCLI #327