DiffusionMapsAcademics / pyDiffMap

Library for diffusion maps
MIT License
47 stars 14 forks source link

Sparse Efficiency #18

Closed davisidarta closed 4 years ago

davisidarta commented 5 years ago

Hi! I'm glad to see such a nice module being developed and I'm currently using it for new biological big-data visualization methods.

When trying to use a sparse matrix (coo_matrix) as input, however, I get the following warning:

/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py:2591: SparseEfficiencyWarning: Comparing sparse matrices using == is inefficient, try using != instead.

I don't know how hard-coded this is inside the module code. It could be interesting to check on that so that sparse inputs are resolved faster.

EDIT: The code works in the sense that there is a resulting object matching the ones generated with dense matrices. However, something seems to be happening to slow this process down when dealing with sparse matrices.

ehthiede commented 5 years ago

Interesting... it's hard to tell from the error message alone what the problem is. Could I trouble you for a minimal example that throws the warning so I can see where it is coming from?

davisidarta commented 5 years ago

Sorry for the lack of further details in the first post. Here's my code:

import numpy as np
import pandas as pd
import os.path
import fcsparser
from scipy.sparse import csc_matrix
from scipy.io import mmread
import tables
import pydiffmap as pdm

# Read in mtx file
count_matrix = mmread(source= "matrix.mtx")
mydmap = pdm.diffusion_map.DiffusionMap.from_sklearn(n_evecs = 100, epsilon = 'bgh', alpha = 1, k = 300)
dmap = mydmap.fit_transform(count_matrix)

From which I get the error:

/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py:2591: SparseEfficiencyWarning: Comparing sparse matrices using == is inefficient, try using != instead.
  return bool(asarray(a1 == a2).all())

Is there a specific mechanic I should be looking into?

ehthiede commented 5 years ago

Not sure yet: I'll look into this and get back to you.

ehthiede commented 4 years ago

Sorry for the delay in getting to this!

I fixed the issue. The problem was that when one passes data to the diffusion map to transform, it checked the data getting passed was the same as the old data. If so, it would not recompute the diffusion map for efficiency. However, we were testing if the new and old data were the same using numpy.array_equal, which would automatically typecast the sparse matrix to a dense one. I've added code that first checks if the matrices are sparse before attempting to cast to a numpy array.

The code is pushed the dev branch.. I should mention we need to update the Continuous Integration parts of the code, so Travis lists the code as "broken." The tests did complete successfully on my machine.

Let me know if this fixes your problem: if so, I'll close the issue.

ehthiede commented 4 years ago

Update: the changes have been pushed to master. Closing the issue.