Psy-Fer / SquiggleKit

SquiggleKit: A toolkit for manipulating nanopore signal data
MIT License
120 stars 23 forks source link

Cannot import dtw_subsequence from mlpy #16

Closed shaniAmare closed 5 years ago

shaniAmare commented 5 years ago

Hi James,

running MotifSeq.py gives me this error from mlpy import dtw_subsequence ImportError: cannot import name dtw_subsequence

Were you planning to use another dynamic time warping module for this? Looks like dtw_subsequence is not a function of mlpy anymore. Please help!

Psy-Fer commented 5 years ago

Could you please did a pip freeze and dump the results?

Is this python 2 or 3?

I have some dtw varieties, but it would mean I'd have to drop it into a pip wheel

shaniAmare commented 5 years ago

I'm assuming whether the pierre-rouanet/dtw work? Adding the following to the script seems to produce something, but I'm not exactly confident whether I'm doing the right thing. If you can comment, if you be great!

Import the modules first... from dtw import dtw from scipy.spatial.distance import euclidean Then replace mlpy.dtw_subsequence with dtw as below dist, cost, acc_cost, path = dtw(adapter[name], sig_search, dist=euclidean)

shaniAmare commented 5 years ago

Could you please did a pip freeze and dump the results?

pip_freeze.txt

Is this python 2 or 3?

It is python 2 bash-4.1$ which python /usr/local/bioinfsoftware/python/python-2.7/bin/python

Psy-Fer commented 5 years ago

Ahh yea, there your mlpy version is 0.1.0

The pip wheel is horrendously out of date. Please follow the links to instructions for installing mlpy on the docs page. The bottom of this page https://psy-fer.github.io/SquiggleKitDocs/install/

Specifically this package https://sourceforge.net/projects/mlpy/files/mlpy%203.5.0/ for whichever OS you are using.

The particular version is 3.5.0

Hope that helps.

shaniAmare commented 5 years ago

Hi James,

Here is the new pip freeze after updating mlpy to 3.5.0: pip_freeze_new.txt

Still, I'm getting the Traceback (most recent call last): File "SourceFiles/SquiggleKit/MotifSeq.py", line 10, in <module> from mlpy import dtw_subsequence ImportError: cannot import name dtw_subsequence

I really appreciate it if you could help me figure out what I'm doing wrong.

Cheers, Shani.

Psy-Fer commented 5 years ago

Hello,

the dtw_subsequence method is definately still there with that version. I downloaded the package again and double checked. image

My guess would be there are some residual leftovers from the previous mlpy install from pip.

Try this maybe?

python -c "import mlpy; print(mlpy.__version__)"

James.

shaniAmare commented 5 years ago

Hey, thanks James.

You are correct! The version showed as 0.1.0 when I do python -c "import mlpy; print(mlpy.__version__)".

Am I using the wrong gsl version? According to this error, if I understand correctly, I'm using a different Python interpreter for running mlpy than the one used to compile gsl.

Did you come across this issue before, how did you fix it?

Cheers, Shani.

Psy-Fer commented 5 years ago

ooooh yea I vaguely remember running into this issue on a machine. trying to remember what I did....

Psy-Fer commented 5 years ago

I think I compiled it from source? got lunch and some meetings. I'll come back to this later.

shaniAmare commented 5 years ago

Cool - thanks!

Hmm... however though, I'm wondering whether the same result for MotifSeq.py can be achieved by the module I mentioned above (i.e. dtw). If I assign the distance computing method as cityblock, which is equal to Manhattan that is used in dtw_subsequence in mlpy, will give the same result? Since you know your code better, I would really like to know what you think...

Thanks a lot, Shani.

Psy-Fer commented 5 years ago
void
subsequence(double *x, double *y, int n, int m, double *cost)
{
  int i, j;

  cost[0] = fabs(x[0]-y[0]);

  for (i=1; i<n; i++)
    cost[i*m] = fabs(x[i]-y[0]) + cost[(i-1)*m];

  for (j=1; j<m; j++)
    cost[j] = fabs(x[0]-y[j]); // subsequence variation: D(0,j) := c(x0, yj)

  for (i=1; i<n; i++)
    for (j=1; j<m; j++)
      cost[i*m+j] = fabs(x[i]-y[j]) +
    min3(cost[(i-1)*m+j], cost[(i-1)*m+(j-1)], cost[i*m+(j-1)]);

}

this is the subsequence algorithm (in C). If it's the same, then I don't see why not. Output may vary depending on how it's implemented, as will speed.

I'll have to have a look at the pierre-rouanet/dtw method and get back to you.

shaniAmare commented 5 years ago

That would be great! Thanks heaps!

There's a "faster" option as well as I saw - 'https://pypi.org/project/fastdtw/`

Cheers, Shani.

Psy-Fer commented 5 years ago

lol. yea, i'll tell you now, that stuff doesn't work for nanopore. We have internally tested this, and the ROC curves were basically random.

Abstraction CAN work, and there are people who are doing this for nanopore, so stay tuned, But it's a tricky beast to wrestle with.

shaniAmare commented 5 years ago

When you say "that stuff doesn't work for nanopore", are you referring to this fastdtw or both fastdtw and pierre-rouanet/dtw? And may I know, from your experience, why that is?

Below is the pierre-rouanet/dtw.dtw code btw:

def dtw(x, y, dist, warp=1, w=inf, s=1.0):
    """
    Computes Dynamic Time Warping (DTW) of two sequences.
    :param array x: N1*M array
    :param array y: N2*M array
    :param func dist: distance used as cost measure
    :param int warp: how many shifts are computed.
    :param int w: window size limiting the maximal distance between indices of matched entries |i,j|.
    :param float s: weight applied on off-diagonal moves of the path. As s gets larger, the warping path is increasingly biased towards the diagonal
    Returns the minimum distance, the cost matrix, the accumulated cost matrix, and the wrap path.
    """
    assert len(x)
    assert len(y)
    assert isinf(w) or (w >= abs(len(x) - len(y)))
    assert s > 0
    r, c = len(x), len(y)
    if not isinf(w):
        D0 = full((r + 1, c + 1), inf)
        for i in range(1, r + 1):
            D0[i, max(1, i - w):min(c + 1, i + w + 1)] = 0
        D0[0, 0] = 0
    else:
        D0 = zeros((r + 1, c + 1))
        D0[0, 1:] = inf
        D0[1:, 0] = inf
    D1 = D0[1:, 1:]  # view
    for i in range(r):
        for j in range(c):
            if (isinf(w) or (max(0, i - w) <= j <= min(c, i + w))):
                D1[i, j] = dist(x[i], y[j])
    C = D1.copy()
    jrange = range(c)
    for i in range(r):
        if not isinf(w):
            jrange = range(max(0, i - w), min(c, i + w + 1))
        for j in jrange:
            min_list = [D0[i, j]]
            for k in range(1, warp + 1):
                i_k = min(i + k, r)
                j_k = min(j + k, c)
                min_list += [D0[i_k, j] * s, D0[i, j_k] * s]
            D1[i, j] += min(min_list)
    if len(x) == 1:
        path = zeros(len(y)), range(len(y))
    elif len(y) == 1:
        path = range(len(x)), zeros(len(x))
    else:
        path = _traceback(D0)
    return D1[-1, -1] / sum(D1.shape), C, D1, path
Psy-Fer commented 5 years ago

yea that pierre-rouanet/dtw is a direct subsequence method and should work. Seems a little convoluted, but has a good amount of tunability in it. Windowing as well. leaving it as inf should wield similar results to mlpy, but this would need to be tested.

fastdtw uses subsampling to do a kind of banded alignment, and can very often lead to entirely spurious hits. In every example of that kind of abstraction technique for nanopore data, it just has not worked. I was first alerted to this by Matt Loose when he was investigating dtw for nanopore, and he had similar observations. We test this in another method we have been developing, where we tested a number of dtw approaches, and mlpy, or a method derived from it, is far superior.

Psy-Fer commented 5 years ago

Hey.

So in mlpy there is an install doc. It has the following

Installing on \*nix from source
-------------------------------

On GNU/Linux, OSX and FreeBSD you need the following requirements:

   * GCC
   * Python >= 2.6 or 3.X
   * NumPy >= 1.3.0 (with header files)
   * SciPy >= 0.7.0
   * GSL >= 1.11 (with header files)

From a terminal run:

.. code-block:: sh

   $ python setup.py install

If you don't have root access, installing mlpy in a directory by
specifying the ``--prefix`` argument. Then you need to set ``PYTHONPATH``:

.. code-block:: sh

   $ python setup.py install --prefix=/path/to/modules
   $ export PYTHONPATH=$PYTHONPATH:/path/to/modules/lib/python{version}/site-packages

If the GSL header files or shared library are in non-standard
locations on your system, use the ``--include-dirs`` and 
``--rpath`` options to ``build_ext``:

.. code-block:: sh

    $ python setup.py build_ext --include-dirs=/path/to/header --rpath=/path/to/lib
    $ python setup.py install

Not sure the gsl you installed is the GNU scientific library used in the C compilers.

the init.py file does a gsl import, but I'm pretty sure that's importing the local folder gsl in the mlpy directory. So that may have something to do with your path variables?

shaniAmare commented 5 years ago

Yes... using the build_ext option before installing worked! Knowing that was very helpful! Wondering whether this section also can be added to the instructions to install mlpy as well to help people like me! But for the moment, I'll have a step-by-step guide here for anyone in need (I hope not! :) ). I'm working on a Linux platform btw.

  1. load python 2.7 (for me it was a module on our cluster, so module load python)
  2. download mlpy 3.5.0 from "https://sourceforge.net/projects/mlpy/files/mlpy%203.5.0/"
  3. unzip the files tar -xvf mlpy-3.5.0.tar
  4. go into the extracted folder cd mlpy-3.5.0
  5. python setup.py build_ext --include-dirs=/path/to/header --rpath=/path/to/lib (for me the GSL headers were in /usr/include/gsl, and respective libraries were in /usr/lib.
  6. then install python setup.py install (I installed it for myself, so python setup.py install --user)
  7. optional: add the executables to PYTHONPATH (if using --prefix option export PYTHONPATH='${PYTHONPATH}:location/to/site-packages)
  8. Next is to see whether you can run MotifSeq.py without an issue :) which worked for me.

I hope I covered the summary from this thread here...

Also, you might already know this but, using mlpy/dtw is way faster than the other option I have mentioned above.

Thanks for helping throughout James. Shani.

Psy-Fer commented 5 years ago

Yay! I'm glad that worked out.

Thanks for the breakdown. I'll add it to the docs.

Also yes, mlpy/dtw is bullet fast (though can be faster, but that's another story).

Thanks for the back and forth. This is helpful for understanding the different environments of others, as well as what is, and isn't explained properly. That can be really hard sometimes. So thankyou!

Also, Heads up, new MotifSeq dropping soon. I brought scrappie into motifseq so it can convert the motif's from a fasta input, and changed up some of the visuals to help people inspect their hits better. Bit of benchmarking coming too in the docs.

Cheers, James

Psy-Fer commented 4 years ago

Hey @shaniAmare

You might like the latest update on the README about mlpy and python3

As well as some of the changes in the output and interface.

Regards, James