Open researcher2 opened 4 years ago
Couldn't see anything obvious, I'll see if I can get it to freeze with debugger attached.
So currently we just run it twice and it works, but we can't get multiprocessing working at all, multiple unpickles of the same MinHashLSH object in different processes also freezes. From the docs it looks like we could just create a fresh object each time with the same basename?
Yes, you can create a fresh object in a newly forked process with the same basename.
We are using Linux so I don't think there are any file locking issues? Our Cassandra install seems to work, I'm not getting any connection or errors with other queries etc.
It would be great if you can put this in a unit test that replicates this bug -- travis.org supports Cassandra instance and we are using it in our tests.
Uh oh, we moved over to creating a new lsh object in each process instance with a set basename. It worked well for about 8 gigabytes of data and now we're getting the error dump:
lsh = MinHashLSH(
File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/lsh.py", line 124, in __init__
self.hashtables = [
File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/lsh.py", line 125, in <listcomp>
unordered_storage(storage_config, name=b''.join([basename, b'_bucket_', struct.pack('>H', i)]))
File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/storage.py", line 99, in unordered_storage
return CassandraSetStorage(config, name=name)
File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/storage.py", line 673, in __init__
self._client = CassandraClient(cassandra_param, name, self._buffer_size)
File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/storage.py", line 381, in __init__
self._session.execute(self.QUERY_CREATE_TABLE.format(table_name))
File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 4877, in cassandra.cluster.ResponseFuture.result
cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {})
We create the MinHashLSH object as follows:
def get_minhash_lsh_cassandra():
lsh = MinHashLSH(
threshold=0.5, num_perm=10, storage_config={
'type': 'cassandra',
'basename': b'minhashlsh',
'cassandra': {
'seeds': ['127.0.0.1'],
'keyspace': 'cc_dedupe',
'replication': {
'class': 'SimpleStrategy',
'replication_factor': '1',
},
'drop_keyspace': False,
'drop_tables': False,
}
}
)
return lsh
We can still create other keyspaces and access the cassandra with cqlsh.
We are a bit time constrained unfortunately, considering moving over to new mongo async code - would this be a decent alternative? Once we get something up and running for deduplication we can hopefully create a test case to demonstrate the initial problem with multiprocess.
EDIT: Profiling of my operation shows:
Generate minhash took 0.01064 seconds.
Get mongo connection took 0.01501 seconds.
Query took 0.01521 seconds.
Insert took 0.66256 seconds.
Close took 0.00403 seconds.
Full document took 0.70834 seconds.
Unfortunately mongo inserts are too slow from the looks of it.
EDIT2: We are using in memory LSH now and I'm trying create a test case for our original Cassandra issue for you.
@ostefano could you kindly help us out?
@ekzhu @researcher2 CREATE TABLE operations are quite I/O intensive in Cassandra because all nodes need to migrate to the new schema (and they do it in a blocking fashion if I recall correctly). What does it happen if the operation of creating the MinHash is simply repeated? The operation should be idempotent IIRC.
@researcher2 is there any update on this issue so far? In-memory LSH isn't a bad idea -- faster update and query time, you can just use pickle to save it to disk/s3. Have you tried using Redis storage layer?
In memory LSH did the job for us once we got more RAM. Cassandra was ok performance wise when it worked but a bit too flaky. We never tried Redis as the persistence of in memory LSH was handled via pickle as you say.
Hello!
Firstly thanks for the library, very happy overall. We are currently using MinHashLSH with Cassandra backend to dedupe a large number of documents and are running into a niggling issue with unpickling the MinHashLSH object.
Our workflow:
On subsequent runs we are just loading the MinHashLSH object.
So on the first run we get the freeze (unpickling never completes). This also occurs if we try to pass the MinHashLSH object directly through python multiprocessing as the same pickling and unpickling occurs.
On the second run the unpickling by itself is working.
So currently we just run it twice and it works, but we can't get multiprocessing working at all, multiple unpickles of the same MinHashLSH object in different processes also freezes. From the docs it looks like we could just create a fresh object each time with the same basename?
We are using Linux so I don't think there are any file locking issues? Our Cassandra install seems to work, I'm not getting any connection or errors with other queries etc.
I am about to have a look through the datasketch code to see if I can find anything obvious or an indicator that I'm doing something wrong.
Thank You