biocore / mds-approximations

Multidimensional scaling algorithms for microbiology-ecology datasets.
6 stars 7 forks source link

Algorithms fail with 10x10 test distance_matrix #16

Closed HannesHolste closed 8 years ago

HannesHolste commented 8 years ago

@antgonza: using your unweighted_unifrac_dm (10x10) as input, I get these errors. However, with the attached full_sym_matrix.txt (100 x 100), they work.

Seems to be a size issue..? Any idea why?

nystrom

$ python mdsa.py run ./data/unweighted_unifrac_dm.txt --algorithm nystrom
Multidimensional scaling approximations. For command line options, re-run with --help flag.

Running these algorithms on given matrix (from `./data/unweighted_unifrac_dm.txt` and outputting to `./out`): nystrom

> Running algorithm nystrom on distance matrix from input file: ./data/unweighted_unifrac_dm.txt
Traceback (most recent call last):
  File "mdsa.py", line 115, in <module>
    main()
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "mdsa.py", line 63, in run
    eigenvectors, eigenvalues, percentages = algorithm.run(input_matrix)
  File "/Users/hannes/bio/knightlab/mds-approximations/mdsa/algorithms/nystrom.py", line 96, in run
    sample_distmtx = array(sample(distance_matrix, n_seeds))
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/random.py", line 323, in sample
    raise ValueError("sample larger than population")
ValueError: sample larger than population

scmds

$ python mdsa.py run ./data/unweighted_unifrac_dm.txt --algorithm scmds
Multidimensional scaling approximations. For command line options, re-run with --help flag.

Running these algorithms on given matrix (from `./data/unweighted_unifrac_dm.txt` and outputting to `./out`): scmds

> Running algorithm scmds on distance matrix from input file: ./data/unweighted_unifrac_dm.txt
Traceback (most recent call last):
  File "mdsa.py", line 115, in <module>
    main()
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/hannes/miniconda2/envs/mdsapprox/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "mdsa.py", line 63, in run
    eigenvectors, eigenvalues, percentages = algorithm.run(input_matrix)
  File "/Users/hannes/bio/knightlab/mds-approximations/mdsa/algorithms/scmds.py", line 54, in run
    self.combine_mds.add(tile_eigvecs, tile_overlap)
  File "/Users/hannes/bio/knightlab/mds-approximations/mdsa/algorithms/scmds.py", line 150, in add
    raise ValueError("not enough items for overlap in reference mds")
ValueError: not enough items for overlap in reference mds

see full code here

unweighted_unifrac_dm.txt unweighted_unifrac_pc.txt

wasade commented 8 years ago

I believe the first issue is due to Python's random.sample not working with numpy arrays (note, error difference likely due to py2.7 vs 3.5):

>>> import numpy as np
>>> import random
>>> a = np.arange(100).reshape(10, 10)
>>> random.sample(a, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/daniel/miniconda3/envs/py35-base/lib/python3.5/random.py", line 311, in sample
    raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
TypeError: Population must be a sequence or set.  For dicts, use list(d).
antgonza commented 8 years ago

The second is due to these lines as it expects large tables and the tests is pretty small. Also, as @wasade pointed out, those are parameters that need some exploration ...

HannesHolste commented 8 years ago

@wasade Thank you, that's indeed problematic and I'll fix it by adapting numpy random.choice to work with it.

The actual error, I realized, is because the distance matrix is 10x10, and the default target dimensionality when not otherwise specified is 10 – obviously there's no use in PCoAing from 10 -> 10.

I wrote an error check to deal with such invalid parameters:

if distance_matrix.shape[0] <= num_dimensions_out:
            raise ValueError('Cannot run PCoA on distance_matrix with shape %s and num_dimensions_out=%d: '
                             'nothing to reduce!' % (repr(distance_matrix.shape), num_dimensions_out))

But there is still a problem with the given code:

Let's assume distance_matrix.shape[0] = 10 and num_dimensions_out = 9 Then n_seeds = int(log(distance_matrix.shape[0], 2)) ** 2 gives us n_seeds = 9.

n_seeds = int(log(distance_matrix.shape[0], 2)) ** 2
        if n_seeds <= num_dimensions_out: # 9 <= 9
            n_seeds = num_dimensions_out + 1 # gives us n_seeds = 10
        elif n_seeds >= distance_matrix.shape[0]:
            n_seeds = distance_matrix.shape[0] - 1

Now the error occurs because sample_distmtx = np.array(random.sample(distance_matrix, n_seeds)) cannot sample a matrix of size 10x10 from a matrix of size 10x10.

I skimmed the Nystrom paper but couldn't find this logic in there. Would it be safe to change the above code to this? I don't know where to look to find the original author's reasoning behind calculating n_seeds.

n_seeds = int(log(distance_matrix.shape[0], 2)) ** 2
        if n_seeds <= num_dimensions_out:
            n_seeds = num_dimensions_out # remove the + 1
        elif n_seeds >= distance_matrix.shape[0]:
            n_seeds = distance_matrix.shape[0] - 1

@antgonza is n_seeds what you mean by 'parameters that need exploration'? Could you elaborate kind of exploration you're thinking of?

antgonza commented 8 years ago

Nop, the exploration is in scmds where we are not sure on the best window and overlap sizes. Obviously, a larger overlap and window size, the better the results are (closer to real, this is an approximation vs. an exact approximation) but also the closer to the actual time to perform the regular method.

Now, I would leave nystrom as is so we can tests the original code as developed .... BTW it has had different rounds of optimization as the code was originally in C and it was converted to python.

HannesHolste commented 8 years ago

Ok, thanks Antonio.

@wasade : it appears that the way random.sample works changed from py 2.7 to 3.x:

$ python
Python 2.7.11 |Continuum Analytics, Inc.| (default, Dec  6 2015, 18:57:58)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import numpy as np
>>> import random
>>> a = np.arange(100).reshape(10, 10)
>>> random.sample(a, 2)
[array([80, 81, 82, 83, 84, 85, 86, 87, 88, 89]), array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])]

I'm going to leave it untouched for now as mdsa runs on py 2.7, unless you think it's a good idea to make this future-proof.