benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets
https://benfred.github.io/implicit/
MIT License
3.52k stars 607 forks source link

Segfault using recommend_all when batch_size is default #273

Closed otosky closed 2 years ago

otosky commented 4 years ago

I'm running a script using Implicit (ALS) on GCP with 8-cores/64gb RAM using Ubuntu 18.04 and get segmentation faults when Recommending All. This is for ~18,000 users and 5 million-ish items, getting top 2000.

Setting MKL_NUM_THREADS to 1, didn't seem to help. And surprisingly, the script doesn't present any issues when I run locally on OSX. I've gotten around the issue by lowering the batch_size on Ubuntu, but I'm curious as to why the issue isn't present on my mac?

OSX details: Cython.__version__ '0.29.6' numpy.__version__ '1.16.2' scipy.__version__ '1.2.1' implicit.__version__ '0.4.0'

Linux details: Cython.__version__ '0.29.12' numpy.__version__ '1.16.4' scipy.__version__ '1.3.0' implicit.__version_\ '0.4.0'

Both running Python 3.7.3 from Anaconda (64-bit)

[New Thread 0x7fffe077b700 (LWP 10380)]
  0%|                                                 | 0/19173 [00:00<?, ?it/s]

Thread 7 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7062980 (LWP 10341)]
0x00007fffe7b200df in fargsort_c(float*, int, int, int, int, int*) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/implicit/recommender_base.cpython-37m-x86_64-linux-gnu.so

#0  0x00007fffe7b200df in fargsort_c(float*, int, int, int, int, int*) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/implicit/recommender_base.cpython-37m-x86_64-linux-gnu.so
#1  0x00007fffe7af10ac in __pyx_pf_8implicit_16recommender_base_23MatrixFactorizationBase_4recommend_all(_object*, _object*, _object*, int, _object*, _object*, _object*, int, _object*, int) [clone ._omp_fn.0] ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/implicit/recommender_base.cpython-37m-x86_64-linux-gnu.so
#2  0x00007ffff5d1269b in __kmp_GOMP_microtask_wrapper ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/../../../libiomp5.so
#3  0x00007ffff5d63ed3 in __kmp_invoke_microtask ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/../../../libiomp5.so
#4  0x00007ffff5d26726 in __kmp_invoke_task_func ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/../../../libiomp5.so
#5  0x00007ffff5d2571c in __kmp_launch_thread ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/../../../libiomp5.so
#6  0x00007ffff5d6430b in _INTERNAL_26_______src_z_Linux_util_cpp_20354e55::__kmp_launch_worker(void*) ()
   from /home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/../../../libiom
#7  0x00007ffff7bbd6db in start_thread (arg=0x7fffd7062980)
    at pthread_create.c:463
#8  0x00007ffff78e688f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
benfred commented 2 years ago

Can you try this on the latest main branch? The batch computation code has been substantially rewritten - and the function the segfault occurs in no longer exists.

otosky commented 2 years ago

I no longer have the Linux environment handy that generated this issue, but I tried out the new recommend api with batching from d86c4420bb219ad5621658840eff8eaa9dc53cf6 in a docker container with fairly starved resources and it worked great 👍 . Feel free to close this.

import numpy as np
from implicit.als import AlternatingLeastSquares
from implicit.datasets.lastfm import get_lastfm

def batch_generator(users, batch_size):
    user_idx_array = np.arange(len(users))

    for start_idx in range(0, len(user_idx_array), batch_size):
        batch = user_idx_array[start_idx: start_idx + batch_size]
        yield batch

if __name__ == "__main__":
    artists, users, plays = get_lastfm()

    model = AlternatingLeastSquares()
    plays = plays.tocsr()
    user_plays = plays.T.tocsr()
    print(f"\nShape of user_items: {user_plays.shape}")

    model.fit(user_plays)

    for batch in batch_generator(users, batch_size=2000):
        print(f"Generating recommendations for users {batch.min()}-{batch.max()}")
        ids, _ = model.recommend(batch, user_plays[batch], N=1000)
        # do extra work & write out batches to an external sink

    print("Finished generating recommendations")

It wasn't entirely clear to me from the lastfm example nor the docs that I needed to pass in a user_items matrix to .recommend where the num of rows matched the num of users until I started hitting this error:

https://github.com/benfred/implicit/blob/d86c4420bb219ad5621658840eff8eaa9dc53cf6/implicit/cpu/matrix_factorization_base.py#L44-L45

I think line 154 here needs updating: https://github.com/benfred/implicit/blob/d86c4420bb219ad5621658840eff8eaa9dc53cf6/examples/lastfm.py#L150-L154


Thanks for building such a great lib!

benfred commented 2 years ago

Thanks for looking into this!

It wasn't entirely clear to me from the lastfm example nor the docs that I needed to pass in a user_items matrix to .recommend where the num of rows matched the num of users until I started hitting this error:

I've made some breaking changes to the API recently - #481 has an overview.

One of the API changes you hit is https://github.com/benfred/implicit/pull/526 . The idea here was that we don't need empty rows in the user_items matrix for users that aren't being recommended - so we need to update invocations to be something like model.recommend(userids, user_items[userids]) instead of model.recommend(userids, user_items) . This change means that if you're recommending items for a single userid=10000000 - you don't need to pass in a sparse matrix with a 10000001 entries in the indptr member , and will be slightly faster in this case.

I'm trying to batch up all the breaking API changes into the next release (0.5.0).

I think line 154 here needs updating:

Good catch! I've fixed here https://github.com/benfred/implicit/pull/532

The new batch code is quite a bit more efficient than the old code - but required some of these API changes to support this: I'm benchmarking the batch calls at about 2x faster on the CPU , and we have GPU batch support now too which is over 60x faster than the CPU version on my system . The GPU support is fast enough now that the lastfm.py example you're running is bottlenecked on writing out the results to disk - rather than actually calculating the results