Dot product performance and CPU usage

organisciak commented 4 years ago

I've been trying to debug our vectorization.py code for converting EF files to SRP and GloVe. It has trouble parallelizing on my system, because all the cores quickly get used up.

It seems to be an issue with the dot product, which is used in this repo as well as my GloVe code for taking a weighted sum of word vectors. https://github.com/bmschmidt/pySRP/blob/3d873fabdf05269ed904638657b27cb12c3b8177/SRP/SRP.py#L213

After some comparison, it seems that a weighted average with np.average works faster, even with one process (when it doesn't clog up all CPUs). If you want a sum, it's possible to multiply the result by the sum of weights - for cosine distance or when using normed vectors, it shouldn't matter though.

Here's the fix that works faster for me:

values = np.average(scores, weights=counts, axis=0)

Or to make the vectors identical to the dot product

values = np.average(scores, weights=counts, axis=0) * sum(counts)

This may be specific to my system, because I know that linear algebra is hard to set up and there may be some BLAS magic that Numpy is trying and failing. Perhaps you can test on another system? If the speedup from the fix isn't universal, can I request an argument to stable_transform to switch to it when parallelizing?

1 PROCESS
Average
125 books done on thread 0: 0.68 books per second
Dot
125 books done on thread 0: 0.49 books per second

5 PARALLEL PROCESSES
Average
125 books done on thread 0: 0.52 books per second
125 books done on thread 1: 0.57 books per second
125 books done on thread 2: 0.67 books per second
125 books done on thread 3: 0.69 books per second
125 books done on thread 4: 0.52 books per second

Dot
125 books done on thread 0: 0.30 books per second
125 books done on thread 1: 0.26 books per second
125 books done on thread 2: 0.29 books per second
125 books done on thread 3: 0.24 books per second
125 books done on thread 4: 0.24 books per second

bmschmidt commented 4 years ago

Bump

bmschmidt commented 4 years ago

Looks like I did fix this locally a few months ago--just pushed to the main branch. Using the version with the product to keep the results identical.

bmschmidt commented 3 years ago

@organisciak reopening because it's at least co-culpable in the error at https://github.com/htrc/htrc-feature-reader/issues/42--out of memory problems on HPC. Rolling back to the old version doesn't raise OOM errors, reading the np.average code, , it looks it broadcasts out the weights to a big array and then swaps them.

I'm going to try a different approach for the time being, which is to call np.matmul instead of np.dot. It appears that's the recommendation for 2d x 2d arrays anyway, so perhaps it will compile better.

If not, it would possible to use the numpy average code without the final division which might provide a marginal speedup anyway, while keeping results essentially the same as the old dot product code.

bmschmidt / pySRP

Dot product performance and CPU usage #12