ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.51k stars 294 forks source link

Connecting AWS keyspaces cassandra #176

Open Priyabrata409 opened 2 years ago

Priyabrata409 commented 2 years ago

How to connect to aws keyspace cassandra as it asks for SSL certificate and service's user name and password ? How to pass it in MinHashLSH's constructor. The way to connect to aws cassandra using python is ` from cassandra.cluster import Cluster from ssl import SSLContext, PROTOCOL_TLSv1_2 , CERT_REQUIRED from cassandra.auth import PlainTextAuthProvider

ssl_context = SSLContext(PROTOCOL_TLSv1_2 ) ssl_context.load_verify_locations('path_to_file/sf-class2-root.crt') ssl_context.verify_mode = CERT_REQUIRED auth_provider = PlainTextAuthProvider(username='ServiceUserName', password='ServicePassword') cluster = Cluster(['cassandra.us-east-2.amazonaws.com'], ssl_context=ssl_context, auth_provider=auth_provider, port=9142) session = cluster.connect() r = session.execute('select * from system_schema.keyspaces') print(r.current_rows)`

ekzhu commented 2 years ago

Have you tried passing these as part of the connection configs in cassandra? http://ekzhu.com/datasketch/lsh.html#connecting-to-existing-minhash-lsh

Priyabrata409 commented 2 years ago

Yes, I have tried but it doesn’t accept the parameters required to connect with Cassandra. You could have a look on the parameters . I am currently able to connect to the local Cassandra but when it comes to AWS key space it is failing

ekzhu commented 2 years ago

I haven't used AWS Cassandra. @ostefano do you have experience with this?

ostefano commented 2 years ago

No experience with Cassandra AWS either unfortunately

alexalbracht-firstparty commented 7 months ago

It is possible to connect to AWS keyspaces by slightly tweaking the kwargs and the get_session() method in CassandraSharedSession. However, AWS keyspaces does not yet support SELECT DISTINCT query needed for QUERY_GET_KEYS. I have provided code below to demonstrate. Perhaps there is a way to rewrite the query to get around this constraint

Screenshot 2024-02-05 at 2 35 40 PM

Call algorithm with AWS keyspaces

lsh = MinHashLSH(
    threshold=0.5, num_perm=128, storage_config={
        'type': 'cassandra',
        'basename': b'testing',
        'cassandra': {
            'seeds': ['cassandra.us-west-2.amazonaws.com'],
            'keyspace': 'tutorialkeyspace',
            'ssl_context': {ssl_context},
            'auth_provider': {auth_provider},
            'port': {port},
            'replication': {
                'class': 'SimpleStrategy',
                'replication_factor': '3',
            },
            'drop_keyspace': False,
            'drop_tables': False,
        }
    }
)```
Adjust Cluster instantiation for AWS kwargs
   def get_session(cls, seeds, **kwargs):
        _ = kwargs
        keyspace = kwargs["keyspace"]
        replication = kwargs["replication"]

        if cls.__session is None and kwargs['ssl_context'] is None:
            # Allow dependency injection
            session = kwargs.get("session")
            if session is None:
                cluster = c_cluster.Cluster(seeds)
                session = cluster.connect()
            cls.__session = session

        if cls.__session is None and kwargs['ssl_context'] is not None:
            # Allow dependency injection
            session = kwargs.get("session")
            if session is None:
                cluster = c_cluster.Cluster(seeds, ssl_context=kwargs["ssl_context"], auth_provider=kwargs["auth_provider"], port=9142)
                # cluster = c_cluster.Cluster(seeds)
                session = cluster.connect()
            cls.__session = session

        if cls.__session.keyspace != keyspace:
            if kwargs.get("drop_keyspace", False):
                cls.__session.execute(cls.QUERY_DROP_KEYSPACE.format(keyspace))
            cls.__session.execute(cls.QUERY_CREATE_KEYSPACE.format(
                keyspace=keyspace,
                replication=str(replication),
            ))
            cls.__session.set_keyspace(keyspace)
        return cls.__session
ekzhu commented 7 months ago

@alexalbracht-firstparty thanks! Would you like to submit a PR to address this?

ostefano commented 7 months ago

That specific query is only used once, and only to get all keys using the TOKEN function (special case).

Now, whether you can add a switch and handle AWS differently boils down to the following test: 1) create a table with a PK containing a CK 2) insert 4 records so that the same PK is used at twice 3) run the query you want to run without distinct and see if you get 2 or 4 records.

If you get back 2, then you can safely go ahead and remove DISTINCT from the query when using AWS.