data-forge / data-forge-ts

The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
http://www.data-forge-js.com/
MIT License
1.34k stars 77 forks source link

fillGaps followed by rollingWindow can perform poorly #11

Closed cberthiaume closed 6 years ago

cberthiaume commented 6 years ago

Hi, I recently upgraded from data-forge-js to data-forge-ts and I'm noticing the following issue (not sure if it was the same in data-forge-js). I have a DataFrame that I want to call fillGaps on. However, the performance was surprisingly slow. After much debugging having believed it was due to something else in our application I believe I narrowed it down to data-forge-ts. Here is an example that should demonstrate the problem and my current workaround:

    const data = [
      { date: moment().toDate(), value: 1 },
      { date: moment().add(1, 'days').toDate(), value: 3 },
      { date: moment().add(5, 'days').toDate(), value: 2 },
      { date: moment().add(6, 'days').toDate(), value: 6 },
      { date: moment().add(7, 'days').toDate(), value: 5 },
      { date: moment().add(12, 'days').toDate(), value: 2 },
      { date: moment().add(15, 'days').toDate(), value: 9 },
    ];
    const df = new dataForge.DataFrame(data).setIndex('date');

    const gapExists = (pairA, pairB) => {
      // Return true if there is a gap longer than a day.
      // Log when we enter this function
      console.log('gapExists [pairA, pairB]', [pairA, pairB]);
      const startDate = pairA[1].date;
      const endDate = pairB[1].date;
      const gapSize = moment(endDate).startOf('day').diff(moment(startDate).startOf('day'), 'days');
      return gapSize > 1;
    };

    const gapFiller = (pairA, pairB) => {
      const startDate = pairA[1].date;
      const endDate = pairB[1].date;
      const gapSize = moment(endDate).startOf('day').diff(moment(startDate).startOf('day'), 'days');
      const numEntries = gapSize - 1;

      const newEntries = [];

      for (let entryIndex = 0; entryIndex < numEntries; entryIndex += 1) {
        const newValue = { date: pairA[1].date, value: pairA[1].value };
        newValue.date = moment(pairA[1].date).add(entryIndex + 1, 'days').toDate();

        newEntries.push([
          moment(pairA[0]).add(entryIndex + 1, 'days').toDate(), // New index
          newValue, // New value
        ]);
      }

      return newEntries;
    };

    // In this case, the final value of dfWithoutGaps is
    // what you'd expect and the number of entries into
    // gapExists is about what I'd expect.
    //
    // I say "about" what I'd expect because there are
    // two seemingly identical comparisons made between the
    // first two items of the data set for a total
    // execution count of 7 when I'd expect 6.
    const dfWithoutGaps = df.fillGaps(gapExists, gapFiller);
    console.log('dfWithoutGaps', dfWithoutGaps.toArray());

    // With the following line uncommented the rollingWindow
    // call below will trigger a huge number of calls to
    // gapExists (I'm not sure how many but enough to max out the
    // buffer in my browser's debug console). It appears as if
    // fillGaps is being run for each rolling window.
    //
    // The speed of generating the rolling window in this
    // scenario is very slow.
    // const mySeries = dfWithoutGaps.getSeries('value');

    // If we force the lazy evaluation via the line below and then
    // run rollingWindow we'll get the expected number of calls (6) to
    // gapExists.
    //
    // The speed of generating the rolling window in this scenario
    // is very fast.
    // https://github.com/data-forge/data-forge-ts/blob/master/docs/guide.md#lazy-evaluation-through-iterators
    const mySeries = new dataForge.Series(dfWithoutGaps.getSeries('value').toArray());

    const smaPeriod = 3;
    const smaSeries = mySeries
      .rollingWindow(smaPeriod)
      .select(window => window.sum() / smaPeriod);

    console.log('smaSeries', smaSeries.toArray());

It's late here so I could be off the rails somewhere. Does this look like an issue with data-forge-ts? If so, is there a better workaround than what I have here?

Thanks a lot in advance. This is a very helpful tool and I appreciate you creating it!

ashleydavis commented 6 years ago

Thanks so much logging the issue with detailed example! I'll investigate and let you know what I come up with.

ashleydavis commented 6 years ago

Hi there, I have your code up and running in a separate github repo:

https://github.com/data-forge/data-forge-performance-test-issue-11

Could you please do me a favor and run the code, then tell me how long it takes to execute. The elapsed milli-seconds will be printed at the end.

cberthiaume commented 6 years ago

Hi, sorry for the slow response. I was out of town. I'm getting this as the time output:

Time: 876.683117

After running it several times it stabilized around 790.

Switching the execution to the "slow" path by uncommenting line 69 and commenting line 79 I get:

Time: 1102.3936440000002

Thanks for looking into this!

cberthiaume commented 6 years ago

I noticed that the timing starts at the very beginning of the script. As an experiment to try and isolate how much time is spend in fillGaps and rollingWindow I moved the stopwatch.start(); call down just below the gapFiller function definition to line 49. Running the script now gives a time of 19 the fast way and 355 the slow way.

This is more indicative of the speed difference that I saw in practice for larger data sets.

ashleydavis commented 6 years ago

What sort of PC are you running it on?

I get around between 400-500ms per run.

ashleydavis commented 6 years ago

Can you please do another experiment and report back to me.

Please update your code then run.

node A.js

Then

node B.js

Report back to me the output from the script. I want to find out how long these script take to run compared to those that contain the Data-Forge code.

Thanks!

cberthiaume commented 6 years ago

I am using a Lenovo T530. Intel i7-3820QM 2.7GHz CPU. 12 GB RAM. Windows 10.

$ node A.js
Time [A]: 940.735355

$ node B.js
Hello
Time [B]: 2.3574950000000285
cberthiaume commented 6 years ago

After running node A.js a few times the time comes down to 790 or so.

ashleydavis commented 6 years ago

This is interesting. It means most of the time that is being spent in the original example is simply coming from the require statements.

If you have a look at A.js all it does is require data-forge and moment and yet it takes almost the same amount of time as the original example.

What this tells me is that I possibly need to pack (maybe something like browserify) the data-forge release so that it is super-fast to load from disk.

What do you think? Does this make sense?

cberthiaume commented 6 years ago

What do you think of this comment where I reported times that did not include the require statements? I agree that most of the time in all the examples is spent in the require statements but there does seem to be a meaningful difference after data-forge is required. Seems like maybe there are two issues. One is the rollingWindow/fillGaps issue and another is (maybe) the time spent requiring data-forge. The former is the one that is a problem in our use case though we can work around it as described.

Or am I misunderstanding something?

ashleydavis commented 6 years ago

I've tried what you said. I moved the start of the timer to after the require statements.

This reduces the time from about 440ms down to 39ms. So a huge difference and most of the time is in the require statements.

So maybe there's two problems, but I'm only seeing one with the current example code.

BTW what version of Node.js are you on. I'm on v8.9.4

cberthiaume commented 6 years ago

Node 8.11.3.

I guess I'm confused. To me it looks like there is a huge difference and it's all due to the performance difference between the documented way to do this vs my work around. The require statements happen on lines 5 and 6 and I do not start timing until line 49. Are you saying the require itself is lazy and doesn't happen until something uses what was required?

What about the different outputs showing, unless I am mistaken, how many more times fillGaps is called via the documented way vs my workaround?

I apologize, I feel like I must be misunderstanding something basic.

ashleydavis commented 6 years ago

I don't really understand what you are talking about with the documented way vs your workaround. Can you elaborate on that?

First though, to clear up any confusion I've made an updated to the performance test repo and I'd like you to run an experiment. Please get the latest code changes then run the following commands:

node index-1.js

Then

node index-2.js

Then tell me what timing you get for each.

Thanks for your patience!

cberthiaume commented 6 years ago

First, the timing for the new test:

node index-1.js
Time [1]: 1084.7819960000002

node index-2.js
Time [2]: 39.433487000000014

I've attached a diff containing two new files to add to this repo.

  1. method-1.js is the slow version that is based on my reading of the documentation for how to implement this. The essential part is line 58.
  2. method-2.js is my workaround to get the performance I'd expect. Again, the essential part is line 58.

Actually, the only difference between the two files is below. This is the difference I am trying to describe.

$ diff method-1.js method-2.js
20c20
<   ];
---
> ];
58c58
< const mySeries = dfWithoutGaps.getSeries('value');
---
> const mySeries = new dataForge.Series(dfWithoutGaps.getSeries('value').toArray());

The times I'm seeing (deleting the smaSeries output):

>node method-1.js
Time to require: 790.7083089999999
Time for everything else: 127.8764040000001

>node method-2.js
Time to require: 773.376429
Time for everything else: 20.6164839999999

As you can see, the require time is the same between the two. The time to compute the rolling window with filled gaps is significantly different.

diff.txt

The above example is for a data object with 7 elements. If I increase this to 27 items I get these times (deleting the smaSeries output):

node method-1.js
Time to require: 777.0008579999999
Time for everything else: 570.8348799999999

node method-2.js
Time to require: 765.3061339999999
Time for everything else: 31.385618999999906

Again we see a similar time to require but the time compute the rolling window and fill gaps is even more stark.

In our application we were working with a data object that had roughly 300 elements. The performance difference for the fast way is on the order of 70ms. The slow way is several minutes.

Here is the larger data object:

const data = [
    { date: moment().toDate(), value: 1 },
    { date: moment().add(1, 'days').toDate(), value: 3 },
    { date: moment().add(5, 'days').toDate(), value: 2 },
    { date: moment().add(6, 'days').toDate(), value: 6 },
    { date: moment().add(7, 'days').toDate(), value: 5 },
    { date: moment().add(12, 'days').toDate(), value: 2 },
    { date: moment().add(15, 'days').toDate(), value: 9 },
    { date: moment().add(16, 'days').toDate(), value: 9 },
    { date: moment().add(17, 'days').toDate(), value: 9 },
    { date: moment().add(18, 'days').toDate(), value: 9 },
    { date: moment().add(19, 'days').toDate(), value: 9 },
    { date: moment().add(30, 'days').toDate(), value: 9 },
    { date: moment().add(31, 'days').toDate(), value: 9 },
    { date: moment().add(32, 'days').toDate(), value: 9 },
    { date: moment().add(33, 'days').toDate(), value: 9 },
    { date: moment().add(34, 'days').toDate(), value: 9 },
    { date: moment().add(35, 'days').toDate(), value: 9 },
    { date: moment().add(36, 'days').toDate(), value: 9 },
    { date: moment().add(37, 'days').toDate(), value: 9 },
    { date: moment().add(38, 'days').toDate(), value: 9 },
    { date: moment().add(39, 'days').toDate(), value: 9 },
    { date: moment().add(40, 'days').toDate(), value: 9 },
    { date: moment().add(41, 'days').toDate(), value: 9 },
    { date: moment().add(42, 'days').toDate(), value: 9 },
    { date: moment().add(50, 'days').toDate(), value: 9 },
    { date: moment().add(51, 'days').toDate(), value: 9 },
    { date: moment().add(52, 'days').toDate(), value: 9 },
];
cberthiaume commented 6 years ago

Well, I don't know what this means but a colleague was unable to reproduce my findings. I repeated my tests and found my method-1.js to be slow. This is the one using const mySeries = dfWithoutGaps.getSeries('value');. I found my workaround to be fast. Same results as before. I ran it multiple times.

Next I did rm -rf node_modules followed by npm install and now my output looks like this:

node method-1.js
Time to require: 867.0699209999999
Time for everything else: 9.503340999999864

node method-2.js
Time to require: 771.9794939999999
Time for everything else: 15.113509000000022

That is, the documented way is faster than the workaround (as I'd expect if the documented way worked properly). I am using data-forge 1.2.3 and node version 8.11.3 and npm version 5.6.0. I have tried this both on my windows machine and in a linux VM and these new results hold. All I did was remove node_modules and reinstall. I don't know what to make of this.

Previously, another colleague and I both saw the performance issues on different machines running different operating systems so it wasn't a one off thing on my machine. ¯_(ツ)_/¯

Any ideas?

ashleydavis commented 6 years ago

Hey I was just going to have a look at your method-1 and method-2 files, but not sure where to find them.

I've added you as a collaborator to my test project: https://github.com/data-forge/data-forge-performance-test-issue-11

Can you please add your files to that repo and push so that I can run them for myself and then I check if I get similar results to you.

ashleydavis commented 6 years ago

@cberthiaume Can you please add your method-1 and method-2 files to the test repo?

cberthiaume commented 6 years ago

@ashleydavis, sorry for the slow response. I have tried pushing to that repo as well as trying to branch from it and push the branch and I get:

ERROR: Permission to data-forge/data-forge-performance-test-issue-11.git denied to cberthiaume.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

However, if you look up in the comment I posted about these files you'll see that I posted a diff that should let you recreate these files.

I've also attached the files below (with .txt appended because github won't allow uploading of a js file).

method-1.js.txt method-2.js.txt

ashleydavis commented 6 years ago

Are you using the right repo?

Make sure you are commiting changes to this one:

https://github.com/data-forge/data-forge-performance-test-issue-11

I invited you to be a collaborator on that repo and I can see in the settings that you haven't accepted the invite. If you accept the invite you should have write access to the repo.

cberthiaume commented 6 years ago

I missed the email with the invitation. Files are there now.

ashleydavis commented 6 years ago

Thanks!

ashleydavis commented 6 years ago

Ok these are the results I get:

node method-1.js Time to require: 5814.571552 Time for everything else: 5.444887999999992

node method-1.js Time to require: 425.20090700000003 Time for everything else: 5.6962610000000495

node method-2.js Time to require: 396.2507800000001 Time for everything else: 11.43570299999999

node method-2.js Time to require: 396.92059500000005 Time for everything else: 12.676636999999914

It seems to warm the cache so that subsequent requires are cheaper.

This is still telling me that I need to bundle Data-Forge so that it loads quickly!

I'm not seeing much difference in the performance, but method 2 is slower for everything else but that would make sense considering you appear to be extracting the series, baking it to a JavaScript array then creating a new series out of it.

I'm not sure I understand what your issue is. Method 2 is bound to be slower than method 1.

cberthiaume commented 6 years ago

Multiple users on multiple systems were seeing dramatically slower performance with method 1. Tens of milliseconds vs several minutes in our use case. That is what led to us creating method 2. I agree method 2 ought to be slower if method 1 is working correctly. I feel like we're somehow not understanding each other on this point.

I presented data above demonstrating this here and here.

Then, I reported here that after removing the node_modules directory and running npm install again the issue went away. I don't have an explanation for why anything changed other than I'm now not using the same data-forge code (or something it depends upon) despite using the same package.json (which means there is some configuration out there where this is a problem).

Ultimately, performance is as expected now so I'm fine to close this.

If I were the only person (or mine the only machine) that experience the issue I would say it was me being dumb. That we had multiple people on different machines and operating systems experience it suggests there's more to it than that. Regardless, the problem is no longer presenting itself.

Thanks for looking into this.

ashleydavis commented 6 years ago

Thanks for persisting and giving me so much information on this issue. I really wish I could have achieved something for you.

I'll close the issue now, but please reopen if the problem happens again or open a new issue for a different performance problem.

ashleydavis commented 6 years ago

Also feel free to dive in and contribute. If you see a problem fix it and submit a pull request!

cberthiaume commented 5 years ago

Hi @ashleydavis, I don't believe I have permission to reopen this issue but I'm am seeing the same behavior again. I just pushed up a change to our test repo. The changes are:

  1. Update package.json to use the most recent version of data-forge
  2. Slightly change the reported timings to make it more clear where the performance issue happens. Specifically, the slow down looks like it's coming out of the call to toPairs().

The tests I'm looking at are method-1.js and method-2.js. The only difference between them is:

$ diff method-1.js method-2.js
76c76
< const mySeries = dfWithoutGaps.getSeries('value');
---
> const mySeries = new dataForge.Series(dfWithoutGaps.getSeries('value').toArray());

Output from running the tests:

cberthiaume@slow-lane:~/data-forge-performance-test-issue-11$ node method-1.js
Time to require: 980.9740000000002
Time to create DataFrame and getSeries: 8.874000000000024
Time for rolling window: 0.08599999999978536
Time for toPairs: 2067.8320000000003
cberthiaume@slow-lane:~/data-forge-performance-test-issue-11$ node method-2.js
Time to require: 975.4
Time to create DataFrame and getSeries: 63.06100000000015
Time for rolling window: 0.10500000000001819
Time for toPairs: 17.16599999999994
cberthiaume@slow-lane:~/data-forge-performance-test-issue-11$

The key difference is the huge difference in time to call toPairs(). Our use case requires us to call toPairs() and the only way to get acceptable performance when doing that is to recreate the series as you see in the diff above. However, our needs have changed that the slow down required to implement this workaround is becoming a bottleneck. Is there a a better way to get good performance from toPairs() without using this workaround? Should I open a separate ticket to track this?

Thanks again for all your help.