Closed HannesHolste closed 8 years ago
I believe the first issue is due to Python's random.sample
not working with numpy array
s (note, error difference likely due to py2.7 vs 3.5):
>>> import numpy as np
>>> import random
>>> a = np.arange(100).reshape(10, 10)
>>> random.sample(a, 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/daniel/miniconda3/envs/py35-base/lib/python3.5/random.py", line 311, in sample
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
TypeError: Population must be a sequence or set. For dicts, use list(d).
The second is due to these lines as it expects large tables and the tests is pretty small. Also, as @wasade pointed out, those are parameters that need some exploration ...
@wasade Thank you, that's indeed problematic and I'll fix it by adapting numpy random.choice
to work with it.
The actual error, I realized, is because the distance matrix is 10x10, and the default target dimensionality when not otherwise specified is 10 – obviously there's no use in PCoAing from 10 -> 10.
I wrote an error check to deal with such invalid parameters:
if distance_matrix.shape[0] <= num_dimensions_out:
raise ValueError('Cannot run PCoA on distance_matrix with shape %s and num_dimensions_out=%d: '
'nothing to reduce!' % (repr(distance_matrix.shape), num_dimensions_out))
But there is still a problem with the given code:
Let's assume distance_matrix.shape[0] = 10 and num_dimensions_out = 9
Then n_seeds = int(log(distance_matrix.shape[0], 2)) ** 2
gives us n_seeds = 9
.
n_seeds = int(log(distance_matrix.shape[0], 2)) ** 2
if n_seeds <= num_dimensions_out: # 9 <= 9
n_seeds = num_dimensions_out + 1 # gives us n_seeds = 10
elif n_seeds >= distance_matrix.shape[0]:
n_seeds = distance_matrix.shape[0] - 1
Now the error occurs because sample_distmtx = np.array(random.sample(distance_matrix, n_seeds))
cannot sample a matrix of size 10x10 from a matrix of size 10x10.
I skimmed the Nystrom paper but couldn't find this logic in there. Would it be safe to change the above code to this? I don't know where to look to find the original author's reasoning behind calculating n_seeds.
n_seeds = int(log(distance_matrix.shape[0], 2)) ** 2
if n_seeds <= num_dimensions_out:
n_seeds = num_dimensions_out # remove the + 1
elif n_seeds >= distance_matrix.shape[0]:
n_seeds = distance_matrix.shape[0] - 1
@antgonza is n_seeds
what you mean by 'parameters that need exploration'? Could you elaborate kind of exploration you're thinking of?
Nop, the exploration is in scmds where we are not sure on the best window and overlap sizes. Obviously, a larger overlap and window size, the better the results are (closer to real, this is an approximation vs. an exact approximation) but also the closer to the actual time to perform the regular method.
Now, I would leave nystrom as is so we can tests the original code as developed .... BTW it has had different rounds of optimization as the code was originally in C and it was converted to python.
Ok, thanks Antonio.
@wasade : it appears that the way random.sample
works changed from py 2.7 to 3.x:
$ python
Python 2.7.11 |Continuum Analytics, Inc.| (default, Dec 6 2015, 18:57:58)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import numpy as np
>>> import random
>>> a = np.arange(100).reshape(10, 10)
>>> random.sample(a, 2)
[array([80, 81, 82, 83, 84, 85, 86, 87, 88, 89]), array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])]
I'm going to leave it untouched for now as mdsa runs on py 2.7, unless you think it's a good idea to make this future-proof.
@antgonza: using your unweighted_unifrac_dm (10x10) as input, I get these errors. However, with the attached full_sym_matrix.txt (100 x 100), they work.
Seems to be a size issue..? Any idea why?
nystrom
scmds
see full code here
unweighted_unifrac_dm.txt unweighted_unifrac_pc.txt