Bug: Segmentation Error

robertsLando commented 6 years ago

I have been able to replicate the bug 'Segmentation Error: Core Dump created'. It happens when you try to close the client while there is a scan or a device read in progress. Test it with this code that is like example.js but edited with a close after the scan and not inside (I have tested it with a serial connection OS: Ubuntu 16.1):


var MbusMaster = require('../index.js');

/* CONNECTION EXAMPLES

var mbusOptions = {
host: '127.0.0.1',
port: port,
autoConnect: true
};

var mbusOptions = {
    serialPort: '/dev/ttyUSB0',
    serialBaudRate: 9600,
    autoConnect: true
};

*/

var mbusOptions = {};

function logUsage(){
  console.log("USAGE");
  console.log("SERIAL: node ./example/example.js -serialPort /dev/ttyUSB0 -serialBaudrate 9600 -autoConnect true");
  console.log("TCP: node ./example/example.js -host 127.0.0.1 -port 2000 -autoConnect true");
}

//PARSE OPTIONS
var args = process.argv.slice(2);

for (var i = 0; i < args.length; i++) {
  if(args[i].startsWith('-'))
    mbusOptions[args[i].substring(1)] = i+1 < args.length ? args[++i].trim() : null;
}

mbusOptions.autoConnect = mbusOptions.autoConnect === 'true';

if(mbusOptions.serialPort && !mbusOptions.serialBaudrate){
  console.log("Missing serial baudRate");
  logUsage();
}
else if(!mbusOptions.serialPort && mbusOptions.serialBaudrate){
  console.log("Missing serial port");
  logUsage();
}
else if(mbusOptions.host && !mbusOptions.port){
  console.log("Missing TCP port");
  logUsage();
}
else if(!mbusOptions.host && mbusOptions.port){
  console.log("Missing TCP host");
  logUsage();
}
else{

  console.log("Connecting with options: " + JSON.stringify(mbusOptions, null, 2));

  var mbusMaster = new MbusMaster(mbusOptions);

  if (!mbusMaster.connect()) {
    console.log('Connection failed.');
    process.exit();
  }

  console.log('Reading device ID 1...');

  // request for data from devide with ID 1
  mbusMaster.getData(1, function(err, data) {
    console.log('err: ' + err);
    console.log('data: ' + JSON.stringify(data, null, 2));

    console.log('Scanning network...');

    mbusMaster.scanSecondary(function(err, data) {
      console.log('err: ' + err);
      console.log('data: ' + JSON.stringify(data, null, 2));

      console.log('Closing client...');
      mbusMaster.close(function(err){
        if(err)
          console.log('Error while closing client');
        else {
          console.log('Client closed');
        }
      });
    });

    console.log('Closing client...');
    mbusMaster.close(function(err){
      if(err)
        console.log('Error while closing client');
      else {
        console.log('Client closed');
      }
    });
  });
}

Apollon77 commented 6 years ago

Great finding. I think I will remember when communication is running and simply do not close even when called and return "false" ... what you think?

robertsLando commented 6 years ago

I think that would be better you someway stop all running process before closing that has more sense in my opinion. I have also noticed other bugs, connection management is making me crazy without events and other connection status variables lol...

Apollon77 commented 6 years ago

Bugs in library or bugs in your code?!

In the end the topic is flow control and the asynchronous execution of the scan/get calls. So close can not wait because would block the node event loop. Or I also need to make connect and close to run asynchron

robertsLando commented 6 years ago

I think bugs are in both sides, just need to find a way to fix them all :)

I'm pretty sure I'm doing something wrong because things works but when client is doing a scan or getting devices data and:

usb disconnected
manual restart of the connection while another operation is running
simply close node-red process or redeploy nodes

an error is thrown or application get stucked (not all times but most and this is so wired)

I have tried to check the c++ files but I never used c++ in my life, just C, so I don't know how to help you in that side.

You told me you have used async workers for operations, what about kill them? Aren't them like threads that you can kill?

Maybe using async workers for connect and close would be fine too!

Apollon77 commented 6 years ago

I hade some thoughts and in the end it is really about "flow control". The libmbus normally is complete synchronous flow. So each call "blocks" till finished. With this you never can create such "async behaviour" on a program-flow level. So disconnecting USB or killing the whole process are the only things you can do and I tested that - this should work without any problem and return an error.

So the "nodejs way" with the need to use async behaviour or you destroy the whole event loop just introduce the option to break the flow by "close/reconnect while working" and all these things.

So I have two ideas that I need to check. First way is to store some kind of "currently-working"-flag in the native part which "declines" anything that is tried to be done and could break the flow. This would mean immediate errors or such. Alternative would be to implement a "queuing" in the nodejs part so that all methods are async in general and if one call is currently processed anything else (including connect/disconnect calls) need to wait till anything before has finished. Would be easier to implement beause I'm also not really the C/C++ guy ;-)

Redeploying nodes in node-red that do not wait till methods have finished will be a problem in both cases, especially for the maybe long running mbus stuff. Still need to think about this. But could be better with the queueing idea becase you can have a callback when ready.

I don't think that I can "kill" an asyncWorker of Nan (need to check that), but a "break as early as you can" on the native side could work - even if it is not "immediately". I thought about that too.

robertsLando commented 6 years ago

Hi @Apollon77! Thanks for your reply. If you check my code I have fixed some problems by introducing an operationRunning flag that is set to true while scanning or getting device datas.

I have also created a method tryClose that checks for this flag before closing the client and if there is an operation running it set a timeout to another tryClose in 500ms (I know it's not so elegant but at least it works without errors).

So this would be your 'first way'.

The problem is with scan, it can takes minutes and when node-red calls 'close' on a node (with a redeploy or when node-red is closed or when the node is removed) it has a timeout (don't know the exact time) and if the node doesn't close it stops waiting and everything stucks.

I have also a commandQueue that takes command and process them one by one

I think the best option in our case is the 'break as early as you can' :)

Apollon77 commented 6 years ago

I also have this "operationRunning "in my project that uses the lib ... I think I implement a mixture ... stay tuned some days till I figured something out :-)

robertsLando commented 6 years ago

Ok I will stay tuned :)

robertsLando commented 6 years ago

Check this :)

M-Bus Dashboard

robertsLando commented 6 years ago

I also get segmentation error when following this steps:

Doing an operation (scan or getdata)
Disconnecting the usb
All devices get an error, so I restart the connection (by doing a close and a connect)

The segmentation fault occurs when close is called on client after usb has been disconnected.

At this point I think most problems are related to close method

Apollon77 commented 6 years ago

First of all I have reset the timeouts for serial to the standards, so it is now ten times faster (only in worstcase some slower devices that work out of specs could be problematic ... we will see when such a devices comes).

I also checked on how to access JavaScript/V8 data from out of the async worker (hwich only has C++ scope) and this will be somewhat tricky ... There are ways but as said tricky ... I think for now I will start with a "communicationInProgress" flag and blocking other functions during that.

The "close after disconnect gives segfault" normally can not be the case ... I added one additional check (just to make sure), but don't think that it could help, so you should retest with next version :-)

Apollon77 commented 6 years ago

Please test 0.4.0 from npm. Changes are:

track if communication is in progress and decline other actions while in progress. returns error
all to change tcp timeouts (default is 4s, very long)
reset serial timeouts to defaults (should be all faster by 10 times)
one "to make sure" change for the close after disconnection to avoid segfaults

Give it a try please

Apollon77 commented 6 years ago

0.4.1 fixes a problem, use this one at least!! 0.4.2 provides a better handling for close. close can now also wait till communication has finished and close could be done. Needs to provide callback AND second parameter. See docs

robertsLando commented 6 years ago

I will try now and let you know :+1: :smile:

Apollon77 commented 6 years ago

I had one "mind change" overnight according the callback handling on close. So small change of behaviour in 0.4.3 just released. now makes more sense when using close with callback (waits for communication finish now by default when callback is provided)

robertsLando commented 6 years ago

So it is like now just removed the boolean waitTillClose that is true by default?

Apollon77 commented 6 years ago

Yes kind of. it is still there, but behaves as follows when unset:

if callback provided, default is true
if NO callback is provided, default is false But you can still manually set it to "false" also when callback is there to get an error callback when communication is in progress.

But in default you do not need that parameter anymore

robertsLando commented 6 years ago

I don't know If there is something wrong with changes but now when I start the scan it keeps reading devices :confused:

Apollon77 commented 6 years ago

So you started a scan and still could read? :-(

Apollon77 commented 6 years ago

The "parallel read" should get an error set in the callback and no data!

He has also compiled while updating?

Apollon77 commented 6 years ago

PS: To check the "internal communication in progress" flag you please could check/output yourMaster.mbusMaster.communicationInProgress where the first "yourMaster" is the name of your instance of the library.

robertsLando commented 6 years ago

Ok just first of all: operations are really faster than before, really nice update! :smile: About what's happening now is that once connected I start reading devices, all operations in my code are queued by default. Devices read operation is done by default when there are no other request by the user (like a scan or a specific address read). So now what happens: client connects and starts reading, than I trigger a scan, It finish the running read operation than it starts the scan, once the scan is finished it restart reading devices cause there are not any more queued operation. There I get an error 'Connection not open' an null at first read, I think this happens because I restart the read inside the scan callback and your flag 'operationRunnung' maybe is still true?

About my previous comment: I was wrong, the operations where not parallel, I said this because the scan has finished just after few seconds instead of one/two minutes and I also got an error on the read so I thoght that was the problem :smile:

Update: Also the error thrown on the data read after the scan has not err.message property but is just a string and I thing this is not correct

Apollon77 commented 6 years ago

You could check as described above if communicationInProgress is still true (just for debug purpose) Try 0.4.4, I reset the flag on some more points to prevent strange things (especilly on connect/close). maybe this helps. Else: Do you use "close" with callback and this had no err?

robertsLando commented 6 years ago

I use the close with the callback and it goes on timeout by node-red that as I told you it has a 15 seconds timout when it try to close nodes, but at least it doesn't throw errors. Could you try to check the error thrown too? It should not be a simple string

Apollon77 commented 6 years ago

Blockling is strange ...

Soooo ... try 0.4.5 (DEBUG VERSION!!)

Please install with --debug!
will log to stderr all flag set's ... need stderr output to see whats happening and why it is not reset in your case. Thank you!
make sure that the err-strings returned from native part ae processed as Error objects

Apollon77 commented 6 years ago

Use 0.4.6 ... 0.4.5 was messed up

robertsLando commented 6 years ago

It seems never closes even if communicationInProgress is false, it goes on timeout every time:

virtual void RecieveWorker::HandleOKCallback(): Set communicationInProgress to false
static Nan::NAN_METHOD_RETURN_TYPE MbusMaster::Get(Nan::NAN_METHOD_ARGS_TYPE): Set communicationInProgress to true
virtual void RecieveWorker::HandleOKCallback(): Set communicationInProgress to false
static Nan::NAN_METHOD_RETURN_TYPE MbusMaster::Get(Nan::NAN_METHOD_ARGS_TYPE): Set communicationInProgress to true
^C26 Mar 09:49:57 - [info] Stopping flows
26 Mar 09:49:57 - [debug] red/nodes/flows.stop : stopping flow : global
26 Mar 09:49:57 - [debug] red/nodes/flows.stop : stopping flow : 91f58b94.449f68
virtual void RecieveWorker::HandleOKCallback(): Set communicationInProgress to false
26 Mar 09:49:58 - [info] [mbus-client:test] Connection closed
26 Mar 09:50:12 - [error] [mbus-client:test] Error stopping node: TimeoutError: timed out after 15000ms
26 Mar 09:50:12 - [debug] TimeoutError: timed out after 15000ms
    at onTimeout (/usr/lib/node_modules/node-red/node_modules/when/lib/decorators/timed.js:69:7)
    at Timeout._onTimeout (/usr/lib/node_modules/node-red/node_modules/when/lib/decorators/timed.js:13:4)
    at ontimeout (timers.js:386:14)
    at tryOnTimeout (timers.js:250:5)
    at Timer.listOnTimeout (timers.js:214:5)
26 Mar 09:50:12 - [info] Stopped flows

robertsLando commented 6 years ago

Ok updating now

robertsLando commented 6 years ago

Ok seems the callback of close method is never called :confused:

Apollon77 commented 6 years ago

Try 0.4.7 please. Debug removed

robertsLando commented 6 years ago

Everything is working like a charm now :tada: even errors when disconnecting USB while operation is running is correclty handled and close works

robertsLando commented 6 years ago

So first a of all my compliment for the improvments :tada: Now: I have seen the update for primary id setting by @lategoodbye . I will start implementing the feature on the node waiting it to be implemented by your library too :smiley:

Apollon77 commented 6 years ago

Everything is working like a charm now 🎉 even errors when disconnecting USB while operation is running is correclty handled and close works

Yes, "this" in async called funcions is a bad idea :-)) This was the reason for the missing callback :-) For the latter: I could try to remove the one change I did to see if it still works without i t... what do you think?!

So first a of all my compliment for the improvments 🎉 Now: I have seen the update for primary id setting by @lategoodbye . I will start implementing the feature on the node waiting it to be implemented by your library too 😃

Yes this is next. I wanted to have this topic sorted out/fixed first to have a clean base

robertsLando commented 6 years ago

Yes this is next. I wanted to have this topic sorted out/fixed first to have a clean base

Sure I agree!

For the latter: I could try to remove the one change I did to see if it still works without i t... what do you think?!

I don't understand, which change are you referencing with 'latter'?

lategoodbye commented 6 years ago

@Apollon77 Could you please send a new pull request which fixes the segmentation fault?

Apollon77 commented 6 years ago

@lategoodbye First of all I will revert that and let @robertsLando test again to verify that this segfault was not caused by other things ... I can not really believe that this was the reason ... So want to proof it first

@robertsLando Would you be so kind and test the 0.4.8 especially for this "close after manual disconnect causes segfault" issue (again). That would be great!

robertsLando commented 6 years ago

Sure but need to wait 2 hours, I'm out right now

robertsLando commented 6 years ago

Anyway: I got a segmentation fault with 0.4.7 with a manual disconnection while reading a device

Apollon77 commented 6 years ago

The change I did could only prevent it with "close after disconnect" ... I will check read again too

Apollon77 commented 6 years ago

Could you install it with --debug and send output for such a Segfault including logs?

Apollon77 commented 6 years ago

I found one potential place where things could go strange ... changed that in 0.4.9 ... could you please try again?!

Apollon77 commented 6 years ago

STOP ... still fixing sthings

I will wait till CI has finished ... I will wait thill CI has finished ... (x98 more)

Apollon77 commented 6 years ago

Ok, 0.4.11 clean again ... please try again to break it and ideally with debug enabled

robertsLando commented 6 years ago

Now I'm back, let me check it. (What about to use a slack channel or another chat for our discussions?)

robertsLando commented 6 years ago

Error when disconnecting on reading a device ID:

*** Error in `node-red': corrupted double-linked list: 0x000055bd17875bc0 ***
======= Backtrace: =========
....
7f87dc5e1000-7f87dc604000 r-xp 00000000 08:04 1981997                    /home/daniel/node-red-contrib-m-bus/node_modules/node-mbus/build/Release/mbus.node
7f87dc604000-7f87dc803000 ---p 00023000 08:04 1981997                    /home/daniel/node-red-contrib-m-bus/node_modules/node-mbus/build/Release/mbus.node
7f87dc803000-7f87dc804000 r--p 00022000 08:04 1981997                    /home/daniel/node-red-contrib-m-bus/node_modules/node-mbus/build/Release/mbus.node
7f87dc804000-7f87dc808000 rw-p 00023000 08:04 1981997                    /home/daniel/node-red-contrib-m-bus/node_modules/node-mbus/build/Release/mbus.node
7f87dc808000-7f87dc80e000 rw-p 00000000 00:00 0 
....

Error with debug enabled:

mbus_serial_recv_frame: Attempt to read 1 bytes [len = 0]
mbus_serial_recv_frame: Got 0 byte [remaining 1, len 0]

mbus_serial_send_frame: Dumping M-Bus frame [5 bytes]: 10 40 FD 3D 16 
mbus_serial_recv_frame: Attempt to read 1 bytes [len = 0]
mbus_serial_recv_frame: Got 0 byte [remaining 1, len 0]

mbus_serial_send_frame: Dumping M-Bus frame [17 bytes]: 68 0B 0B 68 73 FD 52 93 30 92 07 65 32 10 0D D2 16 
mbus_serial_recv_frame: Serial connection is not available.
mbus_select_secondary_address: Unexpected reply for address [079230936532100D]. ret=-1, frameType=0
mbus_serial_send_frame: connection not open
mbus_send_ping_frame: failed to send mbus frame.
26 Mar 14:04:40 - [error] [mbus-client:test] Error while reading device 079230936532100D: Error: The selected secondary address does not match any device [079230936532100D].
26 Mar 14:04:40 - [debug] [mbus-client:test] Restarting client...
26 Mar 14:04:40 - [error] [mbus-client:test] Error while reading device 40007815614C0402: Error: Failed to init slaves.
Segmentation Fault (core dump created)

Apollon77 commented 6 years ago

can you try with --debug for install? (Maybe deleting build folder first :-) ) Do you have a slack channel already setup ?! Else I need to set soemthing up ...

robertsLando commented 6 years ago

Yes, follow this link: https://join.slack.com/t/node-mbus/shared_invite/enQtMzM2MTk3NDM1NjY5LTJlOGI5MmM2N2YyYTJlOTBmNTc0Mjc0NmNkMjE1ZTIxNTAzOGM5MGIxZjAxMDM2NWEwNzA4MDc3M2ViN2MyMDk

robertsLando commented 6 years ago

I have already installed it with --debug option. Check the second part of the comment

robertsLando commented 6 years ago

I have also tried to do a disconnect while scanning in progress and everything works well in that case

Apollon77 commented 6 years ago

@lategoodbye As Info, we moved the main discussion to a Slack, but current status is that when unplugging the USB device a segfault occurs on "close(handle->fd)" in mbus_serial_disconnect. But we are still debugging. Do you know such effects and have an Idea!? I think we still need to call "close" for an filedescriptor even when he got unavailable in the meantime ... or ?!

Apollon77 / node-mbus

Bug: Segmentation Error #9