JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.27k stars 1.16k forks source link

Fix CQLPagingIterator for Amazon Keyspace #3390

Open li-boxuan opened 1 year ago

li-boxuan commented 1 year ago

Unfortunately, Amazon Keyspace diverges from Apache Cassandra when it comes to paging.

Due to the way Amazon Keyspace works, even if a page iterator returns empty results, there still can be a valid next page with data. This happens when ALLOW FILTERING is used. In JanusGraph, this means when a full scan is triggered. We should fix CQLPagingIterator in CQLKeyColumnValueStore.java.

At the moment, I am not 100% sure whether this would also require a fix from the DataStax java driver which implements paging internally as well.

C.C. @porunov since you did quite a bit of the CQLPagingIterator implementation.

References: https://stackoverflow.com/questions/69940435/amazon-keyspaces-allow-filtering-without-all-keys-in-node-js https://docs.aws.amazon.com/keyspaces/latest/devguide/working-with-queries.html#paginating-results https://github.com/JanusGraph/janusgraph/discussions/3378#discussioncomment-4405659

porunov commented 1 year ago

In this case I don't know how to determine that we finished reading data from Amazon Keyspace (at least I'm not sure how to do that with Datastax Java Driver). I would imagine that we can improve CQLPagingIterator with providing a function which builds a new statement with the last page state (i.e. currentResultSet.getExecutionInfo().getPagingState()). That said, if all those statements return nothing then it's not clear that we finished or not. Thus, it will lead to infinite loop. I guess as a workaround we could store the last used page state and the current page state. If they are equal then it means we reached the end (probably). That means we will always have a single redundant call at the end to verify that we really reached the end. Not sure how good this solution is. Also, I guess it could really be implemented on DataStax driver side.

hexa00 commented 1 year ago

@li-boxuan will that be a problem also for a SparkComputer iterating over all edges for example ? I'm not sure if it's doing iteration differently?

li-boxuan commented 1 year ago

@hexa00

That won't affect SparkGraphComputer. Spark traversal uses CqlRecordReader to load the data.

xentripetal commented 1 year ago

Some context on the datastax side, the issue is here

https://github.com/datastax/java-driver/blob/39bf5e363e79d0184b5897d65197202391967bae/core/src/main/java/com/datastax/oss/driver/internal/core/cql/MultiPageResultSet.java#L92

    @Override
    protected Row computeNext() {
      maybeMoveToNextPage();
      return currentRows.hasNext() ? currentRows.next() : endOfData();
    }

where it advances the page then ends the iterator if the next page has no data, regardless of if theres another page.

I'm surprised no one at AWS ran into this issue as this is their officially recommended java driver.

porunov commented 1 year ago

The issue could potentially be fixed in https://github.com/JanusGraph/janusgraph/pull/3760 because the pagination approach is changed in that PR and we don’t use sync pagination anymore. I assume after that PR is merged the issue could be potentially fixed, but I personally didn’t use Amazon Keyspaces, so can’t tell for sure. Anyone who is interested in Amazon Keyspaces is free to try it afterwards.