Open Priyabrata409 opened 2 years ago
Have you tried passing these as part of the connection configs in cassandra
?
http://ekzhu.com/datasketch/lsh.html#connecting-to-existing-minhash-lsh
Yes, I have tried but it doesn’t accept the parameters required to connect with Cassandra. You could have a look on the parameters . I am currently able to connect to the local Cassandra but when it comes to AWS key space it is failing
I haven't used AWS Cassandra. @ostefano do you have experience with this?
No experience with Cassandra AWS either unfortunately
It is possible to connect to AWS keyspaces by slightly tweaking the kwargs and the get_session() method in CassandraSharedSession. However, AWS keyspaces does not yet support SELECT DISTINCT query needed for QUERY_GET_KEYS
. I have provided code below to demonstrate. Perhaps there is a way to rewrite the query to get around this constraint
Call algorithm with AWS keyspaces
lsh = MinHashLSH(
threshold=0.5, num_perm=128, storage_config={
'type': 'cassandra',
'basename': b'testing',
'cassandra': {
'seeds': ['cassandra.us-west-2.amazonaws.com'],
'keyspace': 'tutorialkeyspace',
'ssl_context': {ssl_context},
'auth_provider': {auth_provider},
'port': {port},
'replication': {
'class': 'SimpleStrategy',
'replication_factor': '3',
},
'drop_keyspace': False,
'drop_tables': False,
}
}
)```
Adjust Cluster instantiation for AWS kwargs
def get_session(cls, seeds, **kwargs):
_ = kwargs
keyspace = kwargs["keyspace"]
replication = kwargs["replication"]
if cls.__session is None and kwargs['ssl_context'] is None:
# Allow dependency injection
session = kwargs.get("session")
if session is None:
cluster = c_cluster.Cluster(seeds)
session = cluster.connect()
cls.__session = session
if cls.__session is None and kwargs['ssl_context'] is not None:
# Allow dependency injection
session = kwargs.get("session")
if session is None:
cluster = c_cluster.Cluster(seeds, ssl_context=kwargs["ssl_context"], auth_provider=kwargs["auth_provider"], port=9142)
# cluster = c_cluster.Cluster(seeds)
session = cluster.connect()
cls.__session = session
if cls.__session.keyspace != keyspace:
if kwargs.get("drop_keyspace", False):
cls.__session.execute(cls.QUERY_DROP_KEYSPACE.format(keyspace))
cls.__session.execute(cls.QUERY_CREATE_KEYSPACE.format(
keyspace=keyspace,
replication=str(replication),
))
cls.__session.set_keyspace(keyspace)
return cls.__session
@alexalbracht-firstparty thanks! Would you like to submit a PR to address this?
That specific query is only used once, and only to get all keys using the TOKEN function (special case).
Now, whether you can add a switch and handle AWS differently boils down to the following test: 1) create a table with a PK containing a CK 2) insert 4 records so that the same PK is used at twice 3) run the query you want to run without distinct and see if you get 2 or 4 records.
If you get back 2, then you can safely go ahead and remove DISTINCT from the query when using AWS.
How to connect to aws keyspace cassandra as it asks for SSL certificate and service's user name and password ? How to pass it in MinHashLSH's constructor. The way to connect to aws cassandra using python is ` from cassandra.cluster import Cluster from ssl import SSLContext, PROTOCOL_TLSv1_2 , CERT_REQUIRED from cassandra.auth import PlainTextAuthProvider
ssl_context = SSLContext(PROTOCOL_TLSv1_2 ) ssl_context.load_verify_locations('path_to_file/sf-class2-root.crt') ssl_context.verify_mode = CERT_REQUIRED auth_provider = PlainTextAuthProvider(username='ServiceUserName', password='ServicePassword') cluster = Cluster(['cassandra.us-east-2.amazonaws.com'], ssl_context=ssl_context, auth_provider=auth_provider, port=9142) session = cluster.connect() r = session.execute('select * from system_schema.keyspaces') print(r.current_rows)`