Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

Delay in execution in multiple calls #157

Closed Leandroglez39 closed 3 years ago

Leandroglez39 commented 3 years ago

Describe the bug When executing multiple times there is a delay in execution, which in large cases becomes impossible to compute.

To Reproduce `

include`

include

include

include

include

include "edlib.h"

using namespace std;

int main() {

string s1 = "AGTCAAAAGTCCCCCCAGTAGCT";
string s2 = "AGGGGGTTTTCCCCAAAAAA";

    for (size_t j = 0; j < 5000; i++)
    {

        clock_t start = clock();

        EdlibAlignResult result = edlibAlign(s1.c_str(), s1.size(), s2.c_str(), s2.size(),
            edlibNewAlignConfig(-1, EDLIB_MODE_NW, EDLIB_TASK_PATH, nullptr, 0));

        edlibFreeAlignResult(result);

        std::cout << static_cast<double>(clock() - start) / static_cast<double>(CLOCKS_PER_SEC) << " seconds." << endl;
    }

}

Expected behavior The execution time of this operation should remain constant, but it is not. time .

Environment (please complete the following information):

Martinsos commented 3 years ago

Hi @Leandroglez39 !

What do you mean by saying "in large cases becomes impossible to compute"? Impossible because the delay is too small? Or too big? Or smth else? Pls provide more details on this!

With these short times I see from the image you attached, I would bet on jitter coming from other things, and not from Edlib.

However, this period of 7 runs is unusual. Only thing I can think of is if there is memory leak in edlib that accumulates and somehow gets cleaned up at that point.

I can't easily run this stuff in Windows, so I would suggest following:

  1. Could you report time in micro or nano seconds so we see more precise times?
  2. Could you somehow see memory usage through time?
  3. I am interested in what happens when you run calculation with longer sequences.

Thanks! Martin

Leandroglez39 commented 3 years ago

I'll be a little more specific, previously I told you that I work in multiple sequence alignment (MSA). It has been shown that this is an NP-Complete problem, even with heuristics it has a minimum order O (n ^ 2). An average analysis is a volume of 4000 sequences. Thus:

time nano

4000 4000 0.001 = 16,000 sec -> 4.4 h

However, this period of 7 runs is unusual. Only thing I can think of is if there is memory leak in edlib that accumulates and somehow gets cleaned up at that point.

In response to this, it is not a pattern maintained over time, it is an almost obvious memory leak therefore.

It may be a windows or visual studio problem so I offer the following proposal:

Run on linux follow it:

string s1 = "AGTCAAAAGTCCCCCCAGTAGCT"; string s2 = "AGGGGGTTTTCCCCAAAAAA";

for (size_t j = 0; j < 4000; j++) for (size_t i = 0; i < 4000; i++) {

EdlibAlignResult result = edlibAlign(s1.c_str(), s1.size(), s2.c_str(), s2.size(), edlibNewAlignConfig(-1, EDLIB_MODE_NW, EDLIB_TASK_PATH, nullptr, 0));

edlibFreeAlignResult(result); }

So check the problem. Thank you

El sáb., 8 ago. 2020 a las 10:54, Martin Šošić (notifications@github.com) escribió:

Hi @Leandroglez39 https://github.com/Leandroglez39 !

What do you mean by saying "in large cases becomes impossible to compute"? Impossible because the delay is too small? Or too big? Or smth else? Pls provide more details on this!

With these short times I see from the image you attached, I would bet on jitter coming from other things, and not from Edlib.

However, this period of 7 runs is unusual. Only thing I can think of is if there is memory leak in edlib that accumulates and somehow gets cleaned up at that point.

I can't easily run this stuff in Windows, so I would suggest following:

  1. Could you report time in micro or nano seconds so we see more precise times?
  2. Could you somehow see memory usage through time?
  3. I am interested in what happens when you run calculation with longer sequences.

Thanks! Martin

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Martinsos/edlib/issues/157#issuecomment-670938129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRREOGJEMKNSINTHWXSRFDR7VRKZANCNFSM4PX4O2AA .

Martinsos commented 3 years ago

Thank you for sharing more information :)! Ok hm, I would certainly like to identify the bug if there is one, and as you said, there has been this other report https://github.com/Martinsos/edlib/issues/155 .

Main thing is, both in this issue and in this other one, that it is not clear where is problem coming from, what is causing it, or if there even is a problem hm (although it does seem like there is one). Problem is that I don't have Windows environment set up to run Edlib and test it, and I was not able to reproduce the problem so far on Linux. I could eventually set up Win env to do this, but I did not yet find the time to do it.

I suggest following:

  1. I will try running this on Linux and seeing if there is same problem happening. However, I am currently on vacation, so it will probably take me about a week to get to it.
  2. See if you can get more information on the error, for example how is memory behaving and possibly even where is memory leak coming from.
  3. I could set up Win env and run this in it, but I can't promise when I will get to that since it requires significantly more time from my side.

p.s. I did not quite understand, is the anomaly/delay behaving the same way if you use longer sequences?

Btw, this problem you are solving -> 4000 x 4000 -> there is part of calculation, building the Peq table, that could be reused between the calculations that share the same query and probably significantly reduce the time of calculation, however there is no API currently for that. So that might be interesting feature request, that would help with the use cases like yours.

Leandroglez39 commented 3 years ago

As for your question about whether it occurs with longer sequences, the answer is YES ... What I tried to refer to was that there is no pattern, for example:

1- Not every number of fixed interactions occurs. 2- The delay time is not constant.

On the other hand, it would be interesting to talk about the Small table ...

Enjoy your vacation, then we continue the exchange.

Martinsos commented 3 years ago

@Leandroglez39 I tried running this on Linux, and this is what I got:

I tried sequences with sizes from 20 up to 10000 and times were fairly consistent -> sure, there are some oscillations, but nothing major that I would consider to be problematic or suspicious. However, times were bigger than the ones you have, they were about 0.000x seconds, not 0.00000x seconds as in your example -> I am guessing my machine is slower than yours.

I then tried running edlib with very small sequences, one of size 5, another of size 3, and I got the same pattern as you have: 0.000075 for some time, then suddenly 0.000120 a few times, and then back to 0.000075 seconds.

I run valgrind --tool=memcheck on all of these, and there are no memory leaks.

My conclusion from this is: there are no memory leaks on Linux, but I managed to reproduce the thing when times are very small. But I am pretty sure in this is case it is not something we should worry about in Edlib, times are just so short that this happens, I don't see it is a problem, and we identified it is not caused by a memory leak. It could be jitter in the clock itself, or maybe it took a little bit longer to allocate some memory, or CPU was busy with something else, who knows hm, times are just too short to measure and expect same number every time.

So, I would say problem does not exist, BUT, you said you are also having this problem with longer sequences! Could you tell me how big were they, and again display the times? If this is happening for you for bigger sequences, that have longer times, then we should certainly continue investigating it. I however could not reproduce the problem for bigger sequences.

You can see the code I used here: https://github.com/Martinsos/edlib/blob/issue-157-test/apps/hello-world/helloWorld.c .

Martinsos commented 3 years ago

Btw I opened this issue as feature request for 1 vs N and N vs N use case, and I also added reference to this issue there now: https://github.com/Martinsos/edlib/issues/118 .

Leandroglez39 commented 3 years ago

Thanks anyway, it seems that for the moment we will not be able to detect what is happening to me.

What you propose feature request for 1 vs N and N vs N would be very helpful for the type of work I am trying to do.

I will continue investigating, if I find any other clue about why it is delayed in those cases I will let you know. Thank you

Martinsos commented 3 years ago

@Leandroglez39 What about those longer sequences you mentioned? Could you create longer sequences so that you have longer times, so we can see then if this is still happening? This should be easy to test and I believe it will answer our question. Because these measurements so far, I am pretty sure they are just normal jitter due to measuring such short times -> that is why normally you never measure such short stuff directly, instead you measure a large amount of executions/calls and then take the average, when single execution time is so short.

I will be closing this one for now then, but if you do measurements with longer sequences, pls post the results here!

Leandroglez39 commented 3 years ago

Greetings in advance.

In my case the sequences are not very long, anyway I do the experiments and send the results.

Thank you

El dom., 16 ago. 2020 a las 7:36, Martin Šošić (notifications@github.com) escribió:

Closed #157 https://github.com/Martinsos/edlib/issues/157.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Martinsos/edlib/issues/157#event-3659245356, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRREOG4HVGMR323IRLEAWTSA7ACXANCNFSM4PX4O2AA .