Kurento / bugtracker

[ARCHIVED] Contents migrated to monorepo: https://github.com/Kurento/kurento
46 stars 10 forks source link

KMS Client DEADLOCK - Node JS ends up blocking when using Kurento libs alongside HTTP requests (tried Axios / node-fetch) #192

Open virtualfunction opened 7 years ago

virtualfunction commented 7 years ago

KMS Version: (In Docker with --net=host) Module: 'core' version '6.6.3' Module: 'elements' version '6.6.3' Module: 'filters' version '6.7.0~1.g0314843'

Other libraries versions:

Paste output of executing dpkg -l | egrep -i "kurento|gst.*1.5|nice" in the machine where KMS is running

ii  gstreamer1.5-libav:amd64             1.8.2.1.xenial~20170725171352.96.493eee4  amd64        libav plugin for GStreamer
ii  gstreamer1.5-nice:amd64              0.1.13.1.xenial~20170725160546.81.eebfdab amd64        ICE library (GStreamer plugin)
ii  gstreamer1.5-plugins-bad:amd64       1.8.1.1.xenial~20170725164047.100.3db37b1 amd64        GStreamer plugins from the "bad" set
ii  gstreamer1.5-plugins-base:amd64      1.8.1.1.xenial~20170725154709.55.7b19cfd  amd64        GStreamer plugins from the "base" set
ii  gstreamer1.5-plugins-good:amd64      1.8.1.1.xenial~20170725161537.112.9ee4248 amd64        GStreamer plugins from the "good" set
ii  gstreamer1.5-plugins-ugly:amd64      1.8.1.1.xenial~20170725170621.89.2685b0f  amd64        GStreamer plugins from the "ugly" set
ii  gstreamer1.5-pulseaudio:amd64        1.8.1.1.xenial~20170725161537.112.9ee4248 amd64        GStreamer plugin for PulseAudio
ii  gstreamer1.5-x:amd64                 1.8.1.1.xenial~20170725154709.55.7b19cfd  amd64        GStreamer plugins for X11 and Pango
ii  kms-core-6.0                         6.6.3.xenial.20170811182512.2704d13       amd64        Kurento core module
ii  kms-elements-6.0                     6.6.3.xenial.20170812234356.9e3fff4       amd64        Kurento elements module
ii  kms-filters-6.0                      6.7.0.xenial~20170813001324.1.0314843     amd64        Kurento filters module
ii  kms-jsonrpc-1.0                      1.2.0.xenial~20170811181902.1.8cee9d8     amd64        Kurento jsonrpc library
ii  kmsjsoncpp                           1.6.3.xenial.20170725151047.d78deb7       amd64        Kurento jsoncpp library
ii  kurento-media-server-6.0             6.7.0.xenial~20170813003033.1.6ebaa27     amd64        Kurento Media Server
ii  libgstreamer-plugins-bad1.5-0:amd64  1.8.1.1.xenial~20170725164047.100.3db37b1 amd64        GStreamer development files for libraries from the "bad" set
ii  libgstreamer-plugins-base1.5-0:amd64 1.8.1.1.xenial~20170725154709.55.7b19cfd  amd64        GStreamer libraries from the "base" set
ii  libgstreamer1.5-0:amd64              1.8.1.1.xenial~20170725152356.170.0d6031b amd64        Core GStreamer libraries and elements
ii  libnice10:amd64                      0.1.13.1.xenial~20170725160546.81.eebfdab amd64        ICE library (shared library)

Client libraries

Browsers tested Add OK or FAIL, along with the version, after browsers where you have tested this issue:

System description: Please describe your setup (where is KMS located, where are the clients, STUN, TURN...) Local dev setup - using public STUN.

What steps will reproduce the problem?

  1. Tear out some hair debugging what you assume is a bug in your own code ;-)
  2. Setup one2many project (Currently have this in an Alpine container on Node 8.4, kms in the latest docker container with --net=host, and coturn also with --net=host) using docker-compose but have the same issue when running one2many directly on the host
  3. Wrap startPresenter(sessionId, ws, message.sdpOffer, function(error, sdpAnswer) { ... } code with a HTTP request. I've used axios.get('https://google.com', function (response) { ... } but this also fails with similar results using request_promise / node-fetch

What is the expected result? Remote HTTP request is made and startPresenter to setup WebRTC pipelines / trickle ICE and the publishing to happen

What happens instead? Node goes into deadlock after creating the WebRTC pipeline. Fails to execute processOffer. I believe the axios request is doing something internally such that the JSON RPC calls for ICE and ProcessOffer are getting blocked. However this is an internal Node deadlock. I'm not a Node programmer by trade so knowledge on the topic (other than the standard "make sure you do everything in a non blocking async manner, use Promises properly with .catch etc") is limited when it comes to the internals of Node / V8 / libuv scheduling.

It looks like the Axios request gets made successfully and called back as does the first create pipeline call, but everything after fails due to deadlock so things seem to lock up after a short period of time. It seems deterministic in nature in regard that it always stops at the same point (i.e. even the ICE negotiation / queuing get blocked despite the async nature).

Does it happen with one of the tutorials? Yes. I've used the one2many tutorial as my test case :-)

Please provide any additional information below. When the tutorials are run normally this doesn't happen. If I make a separate project and do HTTP requests like this I have no issue, so there should be no reason why this happens here.

virtualfunction commented 7 years ago

Running strace seems to confirm some sort of FUTEX deadlock

I seem to get 100% on one of my cores and this is repeat

mmap(0x19c7e9d1000, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x19c7e9d1000
munmap(0x19c7e9d1000, 192512)           = 0
munmap(0x19c7ea80000, 331776)           = 0
mmap(0x19c7ea00000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x19c7ea00000
mmap(0x27061d8af000, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x27061d8af000
munmap(0x27061d8af000, 331776)          = 0
munmap(0x27061d980000, 192512)          = 0
mmap(0x27061d900000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x27061d900000
write(8, "\1\0\0\0\0\0\0\0", 8)         = 8
getpid()                                = 1
mmap(0xd650398e000, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0xd650398e000
munmap(0xd650398e000, 466944)           = 0
munmap(0xd6503a80000, 57344)            = 0
mmap(0xd6503a00000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xd6503a00000
futex(0x7f0bcded79e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdeee9e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdec09e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdea99e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1
mmap(0x150e4e8c000, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x150e4e8c000
munmap(0x150e4e8c000, 475136)           = 0
munmap(0x150e4f80000, 49152)            = 0
mmap(0x150e4f00000, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x150e4f00000
futex(0x7f0bcded79e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdeee9e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdec09e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdea99e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcded79e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdeee9e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f0bcdef3514, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x555b208faa64, FUTEX_WAKE_PRIVATE, 1) = 1

file descriptor 8 seems I assume is something to do with the libuv even loop as ls -la /proc/1/fd yields

lr-x------    1 root     root            64 Sep 19 16:13 0 -> pipe:[2473993]
l-wx------    1 root     root            64 Sep 19 16:13 1 -> pipe:[2473994]
lrwx------    1 root     root            64 Sep 19 16:34 10 -> socket:[2471502]
lrwx------    1 root     root            64 Sep 19 16:34 11 -> socket:[2472511]
lrwx------    1 root     root            64 Sep 19 16:34 14 -> socket:[2470840]
lrwx------    1 root     root            64 Sep 19 16:34 15 -> socket:[2472528]
lrwx------    1 root     root            64 Sep 19 16:34 16 -> socket:[2472530]
lrwx------    1 root     root            64 Sep 19 16:34 17 -> socket:[2472548]
l-wx------    1 root     root            64 Sep 19 16:13 2 -> pipe:[2473995]
lr-x------    1 root     root            64 Sep 19 16:13 3 -> pipe:[2471499]
l-wx------    1 root     root            64 Sep 19 16:13 4 -> pipe:[2471499]
lrwx------    1 root     root            64 Sep 19 16:13 5 -> anon_inode:[eventpoll]
lr-x------    1 root     root            64 Sep 19 16:13 6 -> pipe:[2471500]
l-wx------    1 root     root            64 Sep 19 16:13 7 -> pipe:[2471500]
lrwx------    1 root     root            64 Sep 19 16:34 8 -> anon_inode:[eventfd]
lr-x------    1 root     root            64 Sep 19 16:34 9 -> /dev/null

Interestingly express seems to respond still to basic HTTP requests, my Kurento request is stuck though. Running Node dev tools shows that MediaObject.then is the point of the deadlock.

I get the same issue on Node 6.x

micaelgallego commented 7 years ago

Hi virtualfunction,

The new Kurento team is now very small (after Twilio adquisition of most of developers) and we are very busy with other issues. Moreover, we don't have any experienced Node developer in our team. For that reasons, we will not be able to solve your issue in the near future.

If you can solve the issue yourself, we will be glad to accept your Pull Request.

tradersquawkadmin commented 7 years ago

I will have at trying to find the cause by my knowledge of node is limited, as is my time.

If there is anyone who's able to fix this, I'm willing to negotiate a bounty price on this matter.

Oh just realised I'm logged in under another account rather than my virtualfunction account.

piranna commented 7 years ago

Hi @virtualfunction, I'm the original Kurento Node.js APIs designer and developer and I'm available for hire. Can you be able to bring me more info about your issue here or by email, so we can start working on a stimation?

virtualfunction commented 7 years ago

@piranna - Hi, thanks for getting back, I'll PM / get in contact early next week so we can take things forward

virtualfunction commented 7 years ago

@piranna - Just pinged you an email - I've got a private repo (it's private for now just as I've used my own TLS certs) which you should have access with docker-compose wired up to replicate this.

virtualfunction commented 6 years ago

@piranna - I've sent you a couple of emails (24/10/17 15:35 Spanish time was the last one) .... guessing they've been diverted or lost.... Let me know if you have them. If you're a bit busy (or on holiday!) that's fine, let know know so I can at least echo that back.

huaxiaozhong1 commented 6 years ago

@virtualfunction -- At first, may I get your confirmation that "kurento-tutorial-node/kurento-one2many-call" will be blocked when it runs into "webRtcEndpoint.processOffer(sdpOffer, function(error, sdpAnswer)" in server.js?

On my side, I have just added 4 console.log(...) into the funcition startProcenter(..) as below.

function startPresenter(sessionId, ws, sdpOffer, callback) { clearCandidatesQueue(sessionId);

if (presenter !== null) {
    stop(sessionId);
    return callback("Another user is currently acting as presenter. Try again later ...");
}

console.log(" "); console.log("==== Has entered startPresenter: " + sessionId + " ===="); console.log(" ")

presenter = {
    id : sessionId,
    pipeline : null,
    webRtcEndpoint : null
}

getKurentoClient(function(error, kurentoClient) {
    if (error) {
        stop(sessionId);
        return callback(error);
    }

    if (presenter === null) {
        stop(sessionId);
        return callback(noPresenterMessage);
    }

    kurentoClient.create('MediaPipeline', function(error, pipeline) {
        if (error) {
            stop(sessionId);
            return callback(error);
        }

        if (presenter === null) {
            stop(sessionId);
            return callback(noPresenterMessage);
        }

  console.log("    ");
  console.log("==== MediaPipeline is created successully: " + pipeline + " ====");
  console.log("    ")

        presenter.pipeline = pipeline;
        pipeline.create('WebRtcEndpoint', function(error, webRtcEndpoint) {
            if (error) {
                stop(sessionId);
                return callback(error);
            }

            if (presenter === null) {
                stop(sessionId);
                return callback(noPresenterMessage);
            }

    console.log("    ");
    console.log("==== webRtcEndpoint is created successully: " + webRtcEndpoint + " ====");
    console.log("    ")

            presenter.webRtcEndpoint = webRtcEndpoint;

            if (candidatesQueue[sessionId]) {
                while(candidatesQueue[sessionId].length) {
                    var candidate = candidatesQueue[sessionId].shift();
                    webRtcEndpoint.addIceCandidate(candidate);
                }
            }

            webRtcEndpoint.on('OnIceCandidate', function(event) {
                var candidate = kurento.getComplexType('IceCandidate')(event.candidate);
                ws.send(JSON.stringify({
                    id : 'iceCandidate',
                    candidate : candidate
                }));
            });

            webRtcEndpoint.processOffer(sdpOffer, function(error, sdpAnswer) {
                if (error) {
                    stop(sessionId);
                    return callback(error);
                }

                if (presenter === null) {
                    stop(sessionId);
                    return callback(noPresenterMessage);
                }

                callback(null, sdpAnswer);
            });

    console.log("    ");
    console.log("==== processOffer is passed through successully. ====");
    console.log("    ")

            webRtcEndpoint.gatherCandidates(function(error) {
                if (error) {
                    stop(sessionId);
                    return callback(error);
                }
            });
        });
    });
});

}

It will be helpful if you could add the similar log-printers at your side and show us your log then.

Thanks,
                                                             Jim
virtualfunction commented 6 years ago

I'll need to check again, but I recalled webRtcEndpoint.processOffer failed to run. The server logs showed ICE negotiation was happening, but as to how far it got in the process, I'm not sure.

I recall a long time ago using node-fetc / axios, trying to use node's own HTTP API and this resulted slightly random results (however I think this was back in node 4.x days)

huaxiaozhong1 commented 6 years ago

Yeah, please send us your debug log of app server, which "kurento-tutorial-node/kurento-one2many-call" results in. It sounds that you ran a debug-build of kms also, and found error behavior on the Kurento media server side regarding to the issue. So, could you also send us the related debug log that media server results in?

huaxiaozhong1 commented 6 years ago

Again with your description:" Fails to execute processOffer", let's look into function "kms_sdp_session_process_offer" at kms-omni-build/kms-core/src/gst-plugins/commons/kmssdpsession.c.

If you build/run kms as the steps as https://github.com/Kurento/doc-kurento/blob/master/static/kms_development_guide.md#launch-kms, then there should be 2 lines of logouts: "Process SDP Offer:", "Generated SDP Answer:".

it means that the WebRtc peer(EndPoint) at media server side, invoked by startPresenter at server.js, has received "offer" from appServer, the running program "kurento-one2many-call" that is a Node.js app.

According to your description, you thought that the media server was blocked at processOffer. So, could you confirm that the function "kms_sdp_session_process_offer" was called but didn't go to the end? Then we could move the debugging into next step.

Thanks,

virtualfunction commented 6 years ago

Hi, sorry it's been a while, I've not been able to revisit looking at the Kurento stuff I was doing for some months due to time constraints. I did manage to skirt around the issue in the end (to the best of my knowledge) by using the node http libraries directly, via callbacks. I wonder if there is something odd with Promises causing issues here as a result given the callback based approach worked for me, and I happen to notice these issues

https://github.com/Kurento/bugtracker/issues/133 https://github.com/Kurento/bugtracker/issues/181 https://github.com/Kurento/bugtracker/issues/71

I'll try and have a stab later to see if "kurento-client": "https://github.com/Kurento/kurento-client-js#master" works as suggested on 16 Feb last year by tgabi333 (didn't see that before).

If that does fix it can someone else push new versions to NPM as it'll be massively helpful for others. It looks like there's been a bit of activity in the various Kurento repos recently, so is this a sign that we'll have people maintaining stuff? - just as lack of community makes me nervous and end up looking at things like MediaSoup!... something I'd rather avoid ;-)

micaelgallego commented 6 years ago

Hi @virtualfunction,

KurentoClient JS is not very well maintained by our small team. We are focused on Kurento Media Server and KurentoUtils.js.

But if there is a PR that fixes some bug or issue, we will accept it without problem.