ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.59k stars 296 forks source link

Pickling MinHashLSH #135

Open researcher2 opened 4 years ago

researcher2 commented 4 years ago

Hello!

Firstly thanks for the library, very happy overall. We are currently using MinHashLSH with Cassandra backend to dedupe a large number of documents and are running into a niggling issue with unpickling the MinHashLSH object.

Our workflow:

  1. Fresh Cassandra instance
  2. Create MinHashLSH as follows: lsh = MinHashLSH( threshold=0.5, num_perm=10, storage_config={ 'type': 'cassandra', 'cassandra': { 'seeds': ['127.0.0.1'], 'keyspace': 'minhash_lsh_keyspace', 'replication': { 'class': 'SimpleStrategy', 'replication_factor': '1', }, 'drop_keyspace': False, 'drop_tables': False, } } )
  3. Pickle the MinHashLSH object
  4. Load the MinHashLSH object in another process.

On subsequent runs we are just loading the MinHashLSH object.

So on the first run we get the freeze (unpickling never completes). This also occurs if we try to pass the MinHashLSH object directly through python multiprocessing as the same pickling and unpickling occurs.

On the second run the unpickling by itself is working.

So currently we just run it twice and it works, but we can't get multiprocessing working at all, multiple unpickles of the same MinHashLSH object in different processes also freezes. From the docs it looks like we could just create a fresh object each time with the same basename?

We are using Linux so I don't think there are any file locking issues? Our Cassandra install seems to work, I'm not getting any connection or errors with other queries etc.

I am about to have a look through the datasketch code to see if I can find anything obvious or an indicator that I'm doing something wrong.

Thank You

researcher2 commented 4 years ago

Couldn't see anything obvious, I'll see if I can get it to freeze with debugger attached.

ekzhu commented 4 years ago

So currently we just run it twice and it works, but we can't get multiprocessing working at all, multiple unpickles of the same MinHashLSH object in different processes also freezes. From the docs it looks like we could just create a fresh object each time with the same basename?

Yes, you can create a fresh object in a newly forked process with the same basename.

We are using Linux so I don't think there are any file locking issues? Our Cassandra install seems to work, I'm not getting any connection or errors with other queries etc.

It would be great if you can put this in a unit test that replicates this bug -- travis.org supports Cassandra instance and we are using it in our tests.

researcher2 commented 4 years ago

Uh oh, we moved over to creating a new lsh object in each process instance with a set basename. It worked well for about 8 gigabytes of data and now we're getting the error dump:

lsh = MinHashLSH(
  File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/lsh.py", line 124, in __init__
self.hashtables = [
  File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/lsh.py", line 125, in <listcomp>
    unordered_storage(storage_config, name=b''.join([basename, b'_bucket_', struct.pack('>H', i)]))
  File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/storage.py", line 99, in unordered_storage
    return CassandraSetStorage(config, name=name)
  File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/storage.py", line 673, in __init__
    self._client = CassandraClient(cassandra_param, name, self._buffer_size)
  File "/home/researcher2/miniconda3/envs/the_pile/lib/python3.8/site-packages/datasketch/storage.py", line 381, in __init__
    self._session.execute(self.QUERY_CREATE_TABLE.format(table_name))
  File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4877, in cassandra.cluster.ResponseFuture.result
cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {})

We create the MinHashLSH object as follows:

def get_minhash_lsh_cassandra():
    lsh = MinHashLSH(
        threshold=0.5, num_perm=10, storage_config={
            'type': 'cassandra',
            'basename': b'minhashlsh',
            'cassandra': {
                'seeds': ['127.0.0.1'],
                'keyspace': 'cc_dedupe',
                'replication': {
                    'class': 'SimpleStrategy',
                    'replication_factor': '1',
                },
                'drop_keyspace': False,
                'drop_tables': False,
            }
        }
    )
    return lsh

We can still create other keyspaces and access the cassandra with cqlsh.

We are a bit time constrained unfortunately, considering moving over to new mongo async code - would this be a decent alternative? Once we get something up and running for deduplication we can hopefully create a test case to demonstrate the initial problem with multiprocess.

EDIT: Profiling of my operation shows:

Generate minhash took 0.01064 seconds.
Get mongo connection took 0.01501 seconds.
Query took 0.01521 seconds.
Insert took 0.66256 seconds.
Close took 0.00403 seconds.
Full document took 0.70834 seconds.

Unfortunately mongo inserts are too slow from the looks of it.

EDIT2: We are using in memory LSH now and I'm trying create a test case for our original Cassandra issue for you.

ekzhu commented 4 years ago

@ostefano could you kindly help us out?

ostefano commented 4 years ago

@ekzhu @researcher2 CREATE TABLE operations are quite I/O intensive in Cassandra because all nodes need to migrate to the new schema (and they do it in a blocking fashion if I recall correctly). What does it happen if the operation of creating the MinHash is simply repeated? The operation should be idempotent IIRC.

ekzhu commented 3 years ago

@researcher2 is there any update on this issue so far? In-memory LSH isn't a bad idea -- faster update and query time, you can just use pickle to save it to disk/s3. Have you tried using Redis storage layer?

researcher2 commented 3 years ago

In memory LSH did the job for us once we got more RAM. Cassandra was ok performance wise when it worked but a bit too flaky. We never tried Redis as the persistence of in memory LSH was handled via pickle as you say.