Closed robertsLando closed 6 years ago
Great finding. I think I will remember when communication is running and simply do not close even when called and return "false" ... what you think?
I think that would be better you someway stop all running process before closing that has more sense in my opinion. I have also noticed other bugs, connection management is making me crazy without events and other connection status variables lol...
Bugs in library or bugs in your code?!
In the end the topic is flow control and the asynchronous execution of the scan/get calls. So close can not wait because would block the node event loop. Or I also need to make connect and close to run asynchron
I think bugs are in both sides, just need to find a way to fix them all :)
I'm pretty sure I'm doing something wrong because things works but when client is doing a scan or getting devices data and:
an error is thrown or application get stucked (not all times but most and this is so wired)
I have tried to check the c++ files but I never used c++ in my life, just C, so I don't know how to help you in that side.
You told me you have used async workers for operations, what about kill them? Aren't them like threads that you can kill?
Maybe using async workers for connect and close would be fine too!
I hade some thoughts and in the end it is really about "flow control". The libmbus normally is complete synchronous flow. So each call "blocks" till finished. With this you never can create such "async behaviour" on a program-flow level. So disconnecting USB or killing the whole process are the only things you can do and I tested that - this should work without any problem and return an error.
So the "nodejs way" with the need to use async behaviour or you destroy the whole event loop just introduce the option to break the flow by "close/reconnect while working" and all these things.
So I have two ideas that I need to check. First way is to store some kind of "currently-working"-flag in the native part which "declines" anything that is tried to be done and could break the flow. This would mean immediate errors or such. Alternative would be to implement a "queuing" in the nodejs part so that all methods are async in general and if one call is currently processed anything else (including connect/disconnect calls) need to wait till anything before has finished. Would be easier to implement beause I'm also not really the C/C++ guy ;-)
Redeploying nodes in node-red that do not wait till methods have finished will be a problem in both cases, especially for the maybe long running mbus stuff. Still need to think about this. But could be better with the queueing idea becase you can have a callback when ready.
I don't think that I can "kill" an asyncWorker of Nan (need to check that), but a "break as early as you can" on the native side could work - even if it is not "immediately". I thought about that too.
Hi @Apollon77! Thanks for your reply. If you check my code I have fixed some problems by introducing an operationRunning
flag that is set to true while scanning or getting device datas.
I have also created a method tryClose
that checks for this flag before closing the client and if there is an operation running it set a timeout to another tryClose
in 500ms (I know it's not so elegant but at least it works without errors).
So this would be your 'first way'.
The problem is with scan, it can takes minutes and when node-red calls 'close' on a node (with a redeploy or when node-red is closed or when the node is removed) it has a timeout (don't know the exact time) and if the node doesn't close it stops waiting and everything stucks.
I have also a commandQueue that takes command and process them one by one
I think the best option in our case is the 'break as early as you can' :)
I also have this "operationRunning "in my project that uses the lib ... I think I implement a mixture ... stay tuned some days till I figured something out :-)
Ok I will stay tuned :)
Check this :)
I also get segmentation error when following this steps:
close
and a connect
)The segmentation fault occurs when close
is called on client after usb has been disconnected.
At this point I think most problems are related to close
method
First of all I have reset the timeouts for serial to the standards, so it is now ten times faster (only in worstcase some slower devices that work out of specs could be problematic ... we will see when such a devices comes).
I also checked on how to access JavaScript/V8 data from out of the async worker (hwich only has C++ scope) and this will be somewhat tricky ... There are ways but as said tricky ... I think for now I will start with a "communicationInProgress" flag and blocking other functions during that.
The "close after disconnect gives segfault" normally can not be the case ... I added one additional check (just to make sure), but don't think that it could help, so you should retest with next version :-)
Please test 0.4.0 from npm. Changes are:
Give it a try please
0.4.1 fixes a problem, use this one at least!! 0.4.2 provides a better handling for close. close can now also wait till communication has finished and close could be done. Needs to provide callback AND second parameter. See docs
I will try now and let you know :+1: :smile:
I had one "mind change" overnight according the callback handling on close. So small change of behaviour in 0.4.3 just released. now makes more sense when using close with callback (waits for communication finish now by default when callback is provided)
So it is like now just removed the boolean waitTillClose
that is true by default?
Yes kind of. it is still there, but behaves as follows when unset:
But in default you do not need that parameter anymore
I don't know If there is something wrong with changes but now when I start the scan it keeps reading devices :confused:
So you started a scan and still could read? :-(
The "parallel read" should get an error set in the callback and no data!
He has also compiled while updating?
PS: To check the "internal communication in progress" flag you please could check/output yourMaster.mbusMaster.communicationInProgress where the first "yourMaster" is the name of your instance of the library.
Ok just first of all: operations are really faster than before, really nice update! :smile: About what's happening now is that once connected I start reading devices, all operations in my code are queued by default. Devices read operation is done by default when there are no other request by the user (like a scan or a specific address read). So now what happens: client connects and starts reading, than I trigger a scan, It finish the running read operation than it starts the scan, once the scan is finished it restart reading devices cause there are not any more queued operation. There I get an error 'Connection not open' an null at first read, I think this happens because I restart the read inside the scan callback and your flag 'operationRunnung' maybe is still true?
About my previous comment: I was wrong, the operations where not parallel, I said this because the scan has finished just after few seconds instead of one/two minutes and I also got an error on the read so I thoght that was the problem :smile:
Update: Also the error thrown on the data read after the scan has not err.message
property but is just a string and I thing this is not correct
You could check as described above if communicationInProgress is still true (just for debug purpose) Try 0.4.4, I reset the flag on some more points to prevent strange things (especilly on connect/close). maybe this helps. Else: Do you use "close" with callback and this had no err?
I use the close with the callback and it goes on timeout by node-red that as I told you it has a 15 seconds timout when it try to close nodes, but at least it doesn't throw errors. Could you try to check the error thrown too? It should not be a simple string
Blockling is strange ...
Soooo ... try 0.4.5 (DEBUG VERSION!!)
Use 0.4.6 ... 0.4.5 was messed up
It seems never closes even if communicationInProgress
is false, it goes on timeout every time:
virtual void RecieveWorker::HandleOKCallback(): Set communicationInProgress to false
static Nan::NAN_METHOD_RETURN_TYPE MbusMaster::Get(Nan::NAN_METHOD_ARGS_TYPE): Set communicationInProgress to true
virtual void RecieveWorker::HandleOKCallback(): Set communicationInProgress to false
static Nan::NAN_METHOD_RETURN_TYPE MbusMaster::Get(Nan::NAN_METHOD_ARGS_TYPE): Set communicationInProgress to true
^C26 Mar 09:49:57 - [info] Stopping flows
26 Mar 09:49:57 - [debug] red/nodes/flows.stop : stopping flow : global
26 Mar 09:49:57 - [debug] red/nodes/flows.stop : stopping flow : 91f58b94.449f68
virtual void RecieveWorker::HandleOKCallback(): Set communicationInProgress to false
26 Mar 09:49:58 - [info] [mbus-client:test] Connection closed
26 Mar 09:50:12 - [error] [mbus-client:test] Error stopping node: TimeoutError: timed out after 15000ms
26 Mar 09:50:12 - [debug] TimeoutError: timed out after 15000ms
at onTimeout (/usr/lib/node_modules/node-red/node_modules/when/lib/decorators/timed.js:69:7)
at Timeout._onTimeout (/usr/lib/node_modules/node-red/node_modules/when/lib/decorators/timed.js:13:4)
at ontimeout (timers.js:386:14)
at tryOnTimeout (timers.js:250:5)
at Timer.listOnTimeout (timers.js:214:5)
26 Mar 09:50:12 - [info] Stopped flows
Ok updating now
Ok seems the callback of close
method is never called :confused:
Try 0.4.7 please. Debug removed
Everything is working like a charm now :tada: even errors when disconnecting USB while operation is running is correclty handled and close works
So first a of all my compliment for the improvments :tada: Now: I have seen the update for primary id setting by @lategoodbye . I will start implementing the feature on the node waiting it to be implemented by your library too :smiley:
Everything is working like a charm now 🎉 even errors when disconnecting USB while operation is running is correclty handled and close works
Yes, "this" in async called funcions is a bad idea :-)) This was the reason for the missing callback :-) For the latter: I could try to remove the one change I did to see if it still works without i t... what do you think?!
So first a of all my compliment for the improvments 🎉 Now: I have seen the update for primary id setting by @lategoodbye . I will start implementing the feature on the node waiting it to be implemented by your library too 😃
Yes this is next. I wanted to have this topic sorted out/fixed first to have a clean base
Yes this is next. I wanted to have this topic sorted out/fixed first to have a clean base
Sure I agree!
For the latter: I could try to remove the one change I did to see if it still works without i t... what do you think?!
I don't understand, which change are you referencing with 'latter'?
@Apollon77 Could you please send a new pull request which fixes the segmentation fault?
@lategoodbye First of all I will revert that and let @robertsLando test again to verify that this segfault was not caused by other things ... I can not really believe that this was the reason ... So want to proof it first
@robertsLando Would you be so kind and test the 0.4.8 especially for this "close after manual disconnect causes segfault" issue (again). That would be great!
Sure but need to wait 2 hours, I'm out right now
Anyway: I got a segmentation fault with 0.4.7 with a manual disconnection while reading a device
The change I did could only prevent it with "close after disconnect" ... I will check read again too
Could you install it with --debug and send output for such a Segfault including logs?
I found one potential place where things could go strange ... changed that in 0.4.9 ... could you please try again?!
STOP ... still fixing sthings
I will wait till CI has finished ... I will wait thill CI has finished ... (x98 more)
Ok, 0.4.11 clean again ... please try again to break it and ideally with debug enabled
Now I'm back, let me check it. (What about to use a slack channel or another chat for our discussions?)
Error when disconnecting on reading a device ID:
*** Error in `node-red': corrupted double-linked list: 0x000055bd17875bc0 ***
======= Backtrace: =========
....
7f87dc5e1000-7f87dc604000 r-xp 00000000 08:04 1981997 /home/daniel/node-red-contrib-m-bus/node_modules/node-mbus/build/Release/mbus.node
7f87dc604000-7f87dc803000 ---p 00023000 08:04 1981997 /home/daniel/node-red-contrib-m-bus/node_modules/node-mbus/build/Release/mbus.node
7f87dc803000-7f87dc804000 r--p 00022000 08:04 1981997 /home/daniel/node-red-contrib-m-bus/node_modules/node-mbus/build/Release/mbus.node
7f87dc804000-7f87dc808000 rw-p 00023000 08:04 1981997 /home/daniel/node-red-contrib-m-bus/node_modules/node-mbus/build/Release/mbus.node
7f87dc808000-7f87dc80e000 rw-p 00000000 00:00 0
....
Error with debug enabled:
mbus_serial_recv_frame: Attempt to read 1 bytes [len = 0]
mbus_serial_recv_frame: Got 0 byte [remaining 1, len 0]
mbus_serial_send_frame: Dumping M-Bus frame [5 bytes]: 10 40 FD 3D 16
mbus_serial_recv_frame: Attempt to read 1 bytes [len = 0]
mbus_serial_recv_frame: Got 0 byte [remaining 1, len 0]
mbus_serial_send_frame: Dumping M-Bus frame [17 bytes]: 68 0B 0B 68 73 FD 52 93 30 92 07 65 32 10 0D D2 16
mbus_serial_recv_frame: Serial connection is not available.
mbus_select_secondary_address: Unexpected reply for address [079230936532100D]. ret=-1, frameType=0
mbus_serial_send_frame: connection not open
mbus_send_ping_frame: failed to send mbus frame.
26 Mar 14:04:40 - [error] [mbus-client:test] Error while reading device 079230936532100D: Error: The selected secondary address does not match any device [079230936532100D].
26 Mar 14:04:40 - [debug] [mbus-client:test] Restarting client...
26 Mar 14:04:40 - [error] [mbus-client:test] Error while reading device 40007815614C0402: Error: Failed to init slaves.
Segmentation Fault (core dump created)
can you try with --debug for install? (Maybe deleting build folder first :-) ) Do you have a slack channel already setup ?! Else I need to set soemthing up ...
I have already installed it with --debug option. Check the second part of the comment
I have also tried to do a disconnect while scanning in progress and everything works well in that case
@lategoodbye As Info, we moved the main discussion to a Slack, but current status is that when unplugging the USB device a segfault occurs on "close(handle->fd)" in mbus_serial_disconnect. But we are still debugging. Do you know such effects and have an Idea!? I think we still need to call "close" for an filedescriptor even when he got unavailable in the meantime ... or ?!
I have been able to replicate the bug 'Segmentation Error: Core Dump created'. It happens when you try to close the client while there is a scan or a device read in progress. Test it with this code that is like example.js but edited with a close after the scan and not inside (I have tested it with a serial connection OS: Ubuntu 16.1):