JCMais / node-libcurl

libcurl bindings for Node.js
https://npmjs.org/package/node-libcurl
MIT License
666 stars 118 forks source link

`Curl` call wrapped within a Promise makes mocha not finishing test cases. #177

Closed sznowicki closed 4 years ago

sznowicki commented 5 years ago

Sorry if this is not a right place but I'm looping with a debugger for hours already and I think I am out of ideas of what can be wrong.

tldr; testing with mocha is highly unreliable, usually times out

Longer story

I'm using this package for a long time and first I'd like to thank all the contributors for amazing work you do. Thank you.

Today I decided to update my small library by bumping node-libcurl to latest version.

I have small regression/integration test and I found most of them (sometimes all) fail after upgrade with no good reason.

I spent hours trying to detect the issue in my code but I'm pretty sure there is some bug in this code.

I've made a smallest as possible test to prove that: https://github.com/sznowicki/uptime-check/blob/node-libcurl-2/test/node-libcurl-issue-test.js

In this file you'll find three test cases: curl pure implementation, curl wrapped in a promise and called in traditional way, same wrapped curl called in async/await way.

While pure implementation always succeeds (also tried placing it last), Promise.then always times out, async/await usually times out but sometimes is surprisingly green (slow though).

To be honest, I don't have a good idea why this package would cause this behaviour from mocha, but I did try many ways of testing that to exclude other possibilities.

It just feels like something either overrides mocha's "done/await" checks.

On way of testing was to resolve early, before Curl is involved and this made the tests finishing again (also included in the test cases I posted). And I think this is a very good reason to post this issue here and not in the mocha repo.

If you have any questions, or concerns I promise to be active in this issue.

If I missed something stupid and obvious, I'd like to apologise in advance.

PS: maybe it matters: node v10.2.1

EDIT:

JCMais commented 5 years ago

Hi, thanks for reporting this, I will take a look into that this weekend.

There were some version bumps on dependencies and the addon itself is now compiled statically. So there were many changes that could add odd behavior.

If possible, post the output of calling this:

console.log(Curl.getVersionInfoString())
console.log(Curl.VERSION_NUM)
sznowicki commented 5 years ago

Different machine so I post all details: macOS 10.14.4 node v10.10.0 console logs:

Protocols: dict, file, ftp, ftps, gopher, http, https, imap, imaps, ldap, ldaps, pop3, pop3s, rtsp, scp, sftp, smb, smbs, smtp, smtps, telnet, tftp
Features: AsynchDNS, IDN, IPv6, Largefile, NTLM, NTLM_WB, SSL, libz, brotli, TLS-SRP, HTTP2, UnixSockets, HTTPS-proxy
tnolet commented 5 years ago

I can confirm that I see the exact same behaviour running Jest tests.

JCMais commented 5 years ago

@sznowicki if possible also run Curl.getVersion(), I missed this one. It will give the version of libcurl and their dependencies.

sznowicki commented 5 years ago

Sure, sorry for a delay.


Protocols: dict, file, ftp, ftps, gopher, http, https, imap, imaps, ldap, ldaps, pop3, pop3s, rtsp, scp, sftp, smb, smbs, smtp, smtps, telnet, tftp
Features: AsynchDNS, IDN, IPv6, Largefile, NTLM, NTLM_WB, SSL, libz, brotli, TLS-SRP, HTTP2, UnixSockets, HTTPS-proxy
475137
libcurl/7.64.1 OpenSSL/1.1.1b zlib/1.2.11 brotli/1.0.7 libidn2/2.1.1 libssh2/1.8.2 nghttp2/1.34.0```
JCMais commented 5 years ago

There is something weird going on for sure, I just tested the file you posted and it completes after 8s if you increase mocha's timeout.

The strange is that the on end / error callback is called really fast, the delay is between the resolve() / reject() and the then / catch being triggered.

Interesting enough, this only affects the tests, there is no slowdown running the code directly.

I will try to find some time to profile it this week. The workaround, for now, is to increase the timeout for that specific test. You can do that by calling

this.timeout(10000)

Inside the describe.

sznowicki commented 5 years ago

thanks @JCMais for taking this issue seriously.

I'd say increasing timeout might work but since those kind of issues are sometimes symptoms of some serious stuff like overriding global variables I'll hold the upgrade on my end.

I've already checked out the code on my machine and when I have some more time I'd also try to look on this. Maybe I'll find something out.

as-com commented 5 years ago

I noticed this too! And this is happening outside of Mocha, on Windows 10, Node 12.5.0 within ts-node. My code is very similar to the test case posted above (wrapping libcurl with a promise), and the "end" event will be emitted, and the promise would not resolve until a second or two later.

Curiously, adding an event listener for data that console.logs the chunk "fixes" the issue:

curl.on("data", chunk => {
    console.log(chunk);
});

Doing anything other than console.logging the chunk causes the "fix" to not work, including looping over the buffer and printing the sum of its bytes, console.log(chunk.toString()), or even console.log(util.inspect(chunk)). It must be console.log(chunk), console.error(chunk), or some function that eventually passes the chunk directly into console.log/console.error. πŸ€”

as-com commented 5 years ago

I spoke too soon - the console.log "fix" only works when it is running in the WebStorm debugger. I suspect there is a really subtle issue with the way node-libcurl is interacting with the event loop.

Update:

This is turning out to be very messy to debug - placing breakpoints within the end handler causes the issue to go away.

Update:

In the same program, there is a ReconnectingWebSocket. Commenting out the instantiation of ReconnectingWebSocket fixes the issue.

With the ReconnectingWebSocket instantiated with a URL that does not have a server running, it looks like there is something blocking the event loop:

curl.on("end", (statusCode, body) => {
    console.log("END", Date.now());

    curl.close();
    console.log("CLOSE", Date.now());
    process.nextTick(() => {
        console.log("TICK", Date.now());
        resolve({
            statusCode,
            body: body as string,
            // headers
        });
    })
})
END 1561930626028
CLOSE 1561930626030
TICK 1561930628715
RESOLVE 1561930628715

Update:

Replacing the ReconnectingWebSocket with a normal ws WebSocket:

END 1561930740436
CLOSE 1561930740437
TICK 1561930741973
Error: connect ECONNREFUSED 127.0.0.1:3000
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1056:14)

Update:

With a real WebSocket server running:

Connected to WebSocket server
END 1561930916999
CLOSE 1561930917000
TICK 1561930920992
RESOLVE 1561930920992
1561930920992
END 1561930921603
CLOSE 1561930921603
TICK 1561930922824
RESOLVE 1561930922824
1561930922824

Update:

When I drop setInterval(() => {}, 10); into the file, the code works as expected:

Connected to WebSocket server
END 1561931032224
CLOSE 1561931032224
TICK 1561931032230
RESOLVE 1561931032230
1561931032230
END 1561931032534
CLOSE 1561931032534
TICK 1561931032540
RESOLVE 1561931032540
1561931032540

And with an interval of 500 ms:

Connected to WebSocket server
END 1561931103926
CLOSE 1561931103927
TICK 1561931104332
RESOLVE 1561931104332
1561931104332
END 1561931104490
CLOSE 1561931104490
TICK 1561931104832
RESOLVE 1561931104832
1561931104832

So the issue is that the event loop simply isn't ticking properly under certain conditions.

Update:

Wrapping the resolve in setImmediate actually fixes the issue:

        curl.on("end", (statusCode, body) => {
            console.log("END", Date.now());

            curl.close();
            console.log("CLOSE", Date.now());
            setImmediate(() => {
                console.log("TICK", Date.now());
                resolve({
                    statusCode,
                    body: body as string,
                    // headers
                });
            })
        });
END 1561931280096
CLOSE 1561931280097
TICK 1561931280097
RESOLVE 1561931280098
1561931280098
END 1561931280271
CLOSE 1561931280271
TICK 1561931280272
RESOLVE 1561931280272
1561931280272

Explanation of process.nextTick vs setImmediate: https://nodejs.org/en/docs/guides/event-loop-timers-and-nexttick/

as-com commented 5 years ago

For those following along by email: the comment above has been updated with a band-aid solution to this bug.

tl;dr: wrap your resolve call in setImmediate.

sznowicki commented 5 years ago

Thanks @as-com for additional research. I'll still wait though until we know what's really the cause of this issue since it seems to be very mysterious.

tnolet commented 5 years ago

@as-com this is awesome and shows we have a true Heisenbug!

JCMais commented 5 years ago

@as-com thanks for investing some time to debug this issue.

I still have not found time to look into it myself, sorry.

For anyone trying to debug this more, something that would be interesting to try is to check if there is any slowdown on the native side. A really basic way to do that is just printing something from those two functions: https://github.com/JCMais/node-libcurl/blob/8eb18a8a1029a46f281801fb9b88d2bb751264de/src/Multi.cc#L141-L177 https://github.com/JCMais/node-libcurl/blob/8eb18a8a1029a46f281801fb9b88d2bb751264de/src/Multi.cc#L180-L199

And from this one on the JS side: https://github.com/JCMais/node-libcurl/blob/8eb18a8a1029a46f281801fb9b88d2bb751264de/lib/Curl.ts#L68-L83

The log inside the onMessage above should immediately follow the call to the OnSocket above.

sznowicki commented 5 years ago

Did what you suggested. Here's the code: https://github.com/JCMais/node-libcurl/compare/develop...sznowicki:177-debug?expand=1

I'm not sure how to analyse those but here's the console output:

szymon@MacBook-Air-Szymon uptime-check (node-libcurl-2) $ npm t

> uptime-check@2.0.0 test /Users/szymon/dev/uptime-check
> mocha --recursive

  Promisified curl call test
Multi::OnSocket started 
Multi will processMessages 
Multi::OnSocket started 
Multi will processMessages 
Multi::OnSocket started 
Multi will processMessages 
Multi::OnSocket started 
Multi will processMessages 
Multi::OnSocket started 
Multi will processMessages 
Multi::OnSocket started 
Multi will processMessages 
Multi::OnSocket started 
Multi will processMessages 
multiHandle.onMessage starts!
PURE DONE
    βœ“ should success, pure curl (438ms)
Multi::OnSocket started 
Multi will processMessages 
Multi::OnSocket started 
Multi will processMessages 
multiHandle.onMessage starts!
END END END, now resolve
    1) should resolve
TEST DONE
Multi::OnSocket started 
Multi will processMessages 
multiHandle.onMessage starts!
END END END, now resolve
    2) should resolve (async/await way)
ASYNC/AWAIT DONE

  1 passing (4s)
  2 failing

  1) Promisified curl call test
       should resolve:
     Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/Users/szymon/dev/uptime-check/test/node-libcurl-issue-test.js)

  2) Promisified curl call test
       should resolve (async/await way):
     Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/Users/szymon/dev/uptime-check/test/node-libcurl-issue-test.js)

npm ERR! Test failed.  See above for more details.
sznowicki commented 5 years ago

I'm checking it again and what's interesting is that changing my test code like this:

 curl.on('end', () => {
    console.log('END END END, now resolve');
    setTimeout(() => resolve(), 0);
    // resolve()
  });

Makes the tests green again. Which is really interesting.

JCMais commented 5 years ago

Know what is even more interesting? You don't need to wrap the resolve/reject in a setTimeout at all, just this will make it resolve instantly:

  curl.on('end', () => {
    console.log('END END END, now resolve');
    resolve()
    setTimeout(() => {}, 0) // or setImmediate(() => {})
  });

πŸ€”

I am starting to think this is a really weird async iteration with Mocha.

JCMais commented 5 years ago

Alright, I am pretty sure this is a Mocha issue, I just created a test case using Jest and it worked correctly, and running the code directly does not show any slowdown.

I've opened an issue on the Mocha repo

as-com commented 5 years ago

I don't think this is specific to Mocha, it happens when there is a ws WebSocket client in the same process. There must be some weird thing going on with the way node-libcurl is interacting with the event loop.

JCMais commented 5 years ago

@as-com can you post some code that reproduces it without using mocha?

as-com commented 5 years ago

@JCMais Turns out no WebSocket is even required. Here is some code that reproduces the issue:

import {Curl, CurlFeature} from "node-libcurl";

interface CurlHttpResponse {
    statusCode: number,
    body: string,
    // headers: Buffer | HeaderInfo[]
}

setTimeout(() => {}, 5000);

const defaultCurl = new Curl();
defaultCurl.setOpt(Curl.option.SSL_VERIFYPEER, false);

function httpRequestComplicated(url: string): Promise<CurlHttpResponse> {
    return new Promise((resolve, reject) => {
        const curl = defaultCurl.dupHandle(false);
        curl.setOpt(Curl.option.URL, url);

        curl.enable(CurlFeature.NoHeaderStorage);
        curl.enable(CurlFeature.NoHeaderParsing);

        curl.on("error", (error, errorCode) => {
            curl.close();
            // @ts-ignore
            error.errorCode = errorCode;
            reject(error);
        });
        curl.on("end", (statusCode, body) => {
            console.log("END", Date.now());
            curl.close();
            resolve({
                statusCode,
                body: body as string,
                // headers
            });
            // console.log("CLOSE", Date.now());
            // setImmediate(() => {
            //     // console.log("TICK", Date.now());
            //
            // })
        });

        curl.perform();
    });
}

console.log("START", Date.now());
httpRequestComplicated("https://icanhazip.com").then(x => {
    console.log(x, Date.now())
}).catch(console.error);

(sorry about the mess, I just quickly ripped it out of another codebase)

JCMais commented 5 years ago

Thanks @as-com, looks like the close method is in some way blocking the event loop, but the call to close itself does not cause any delay:

START 2019-07-28T18:21:58.241Z
END - before close 2019-07-28T18:21:58.316Z
END - after close 2019-07-28T18:21:58.318Z
RESOLVED 2019-07-28T18:22:03.241Z

Wrapping the call to curl.close() in a setImmediate makes the promise resolve without any delay. Commenting the setTimeout you added makes the problem disappear, and even more weird looks like the promise is only resolved after the timeout finishes.

The original issue on the question is still weird, because it was not calling close at all, and seems to run correctly on Jest and directly. πŸ€”

I will try to investigate it further

Edit: There is no need to wrap curl.close() in a setImmediate, even just calling setImmediate(() => {}) will make it resolve faster

sznowicki commented 5 years ago

I'm nowhere near to fully understand what the C code does here, but since it's a nice challenge I'd ask:

What this thing does in Multi::Dispose?

uv_timer_stop(this->timeout.get());
JCMais commented 5 years ago

This stops the libuv timer so it does not call their callback anymore. It's not safe to let it call the callback after the Multi instance is disposed of because the memory pointed by the data passed to it will not be valid anymore, see:

https://github.com/JCMais/node-libcurl/blob/4a1bde612318d3374c19c56883965fe8d43e5e77/src/Multi.cc#L134

and: https://github.com/JCMais/node-libcurl/blob/4a1bde612318d3374c19c56883965fe8d43e5e77/src/Multi.cc#L180

This is all based on the libcurl usage in an event-based way, see the following links for more details: https://curl.haxx.se/libcurl/c/libcurl-multi.html (MULTI_SOCKET section) https://curl.haxx.se/libcurl/c/curl_multi_socket_action.html

sznowicki commented 4 years ago

@as-com did you maybe find something? Is this issue causing you any troubles on production?

as-com commented 4 years ago

@sznowicki I am continuing to use the above stated workaround in production.

JCMais commented 4 years ago

I've tried again, with no luck, to debug this issue. Opened an issue on nodejs/help just in case someone else has a better idea on what might be causing this: https://github.com/nodejs/help/issues/2586

sznowicki commented 4 years ago

Well written question there. Thanks a lot for keeping track of this issue. Highly appreciated.

JCMais commented 4 years ago

296 days later this has been finally fixed πŸŽ‰πŸ₯³, for details see the issue above.

The fix is available on the prerelease version 2.1.0-4, which can be installed with:

yarn add node-libcurl@next

If everything works correctly, I will release the 2.1.0 version at the start of next week, I will close this issue when I do that.

tnolet commented 4 years ago

@JCMais this is awesome. We really want to move to node-libcurl at https://checklyhq.com but this was a big blocker

sznowicki commented 4 years ago

@JCMais big warm thank you for getting this done. Now it's time for me to finally migrate to @2 with my uptime-checker module.

Thank you!

JCMais commented 4 years ago

Closing this as v2.1.0 (see release page for full changelog) has been finally released. πŸŽ‰