Open tjtolon opened 9 years ago
Whoa, that's a critical bug! I'm going to have trouble getting any computer time today, but I'll get into this as soon as possible.
I was able to have this bug not to come up a couple of times randomly, further indicating a problem in concurrency.
What setup do you make for debugging/profiling?
There is a vagrant box in the repo that I try to keep reproducible. However, it was on GHC 7.8.3, so I updated it to 7.10.2 this morning for testing this issue. I'll update the vagrant setup when I can see things working.
Okay, I had a chance to try things out and couldn't reproduce. The vagrant box is Ubuntu 14.04.1 LTS (trusty) now with GHC-7.10.2, However, that's with ROS Indigo, so perhaps it's a problem with Jade Turtle. Just to be 100% clear as to what I tried:
Starting directory: /home/vagrant/roshask
cabal sandbox init
cabal install
cd Examples/PubSub
cabal sandbox init --sandbox=/home/vagrant/roshask/.cabal-sandbox
../../dist/build/roshask/roshask dep
cabal build
Now, with roscore
running, I can use two shells to run both cabal run listener
and cabal run talker
. I also tried using rostopic echo /chatter
instead of, and in parallel with, the listener
program, and kept an eye on top
while doing so. I didn't see any runaway memory use when stopping and starting things.
So I didn't completely recreate your setup in that I used the existing Examples and a different version of ROS. Can you give the Examples
a quick try to see if they exhibit the same problem?
I'll try getting Jade Turtle up and running to see if that's the issue.
I made some tries with the Examples/PubSub
with the only difference that I have haskell libraries installed directly to ¨/.cabal, without the cabal sandbox.
rostopic echo
showed the memory leaking behaviourThanks for helping with this! I'll have to try with jade to see if a new rostopic is doing something funny. I got jade installed, but haven't had a chance to try it yet.
I still can't reproduce this with ROS Jade Turtle. I ran the talker
node, then started and stopped rostopic echo /chatter
10 times, each time letting it run for at least 20s. I never saw talker
using more than 1% CPU. The good news is that it's not a compatibility break in the ROS transport protocol, the bad news is I haven't fixed the issue.
Since one main difference between our test setups is the use of a sandbox, here's some relevant output:
vagrant@vagrant-ubuntu-trusty-64:~/roshask/Examples/PubSub$ cabal sandbox hc-pkg list
/opt/ghc/7.10.2/lib/ghc-7.10.2/package.conf.d
Cabal-1.22.4.0
array-0.5.1.0
base-4.8.1.0
bin-package-db-0.0.0.0
binary-0.7.5.0
bytestring-0.10.6.0
containers-0.5.6.2
deepseq-1.4.1.1
directory-1.2.2.0
filepath-1.4.0.0
ghc-7.10.2
ghc-prim-0.4.0.0
haskeline-0.7.2.1
hoopl-3.10.0.2
hpc-0.6.0.2
integer-gmp-1.0.0.0
pretty-1.1.2.0
process-1.2.3.0
rts-1.0
template-haskell-2.10.0.0
terminfo-0.4.0.1
time-1.5.0.1
transformers-0.4.2.0
unix-2.7.1.0
xhtml-3000.2.1
/home/vagrant/roshask/.cabal-sandbox/x86_64-linux-ghc-7.10.2-packages.conf.d
Boolean-0.2.3
BoundedChan-1.0.3.0
HUnit-1.3.0.0
HaXml-1.25.3
HsOpenSSL-0.11.1.1
MemoTrie-0.6.4
MonadCatchIO-transformers-0.3.1.3
NumInstances-1.4
ROS-std-msgs-0.3
SafeSemaphore-0.10.1
aeson-0.10.0.0
attoparsec-0.13.0.1
attoparsec-enumerator-0.3.4
base-compat-0.8.2
base64-bytestring-1.0.0.1
blaze-builder-0.4.0.1
blaze-builder-enumerator-0.2.1.0
bytestring-builder-0.10.6.0.0
bytestring-mmap-0.2.2
case-insensitive-1.2.0.5
cereal-0.5.1.0
crypto-api-0.13.2
data-default-generics-0.3
dlist-0.7.1.2
entropy-0.3.7
enumerator-0.4.20
extensible-exceptions-0.1.1.4
filemanip-0.3.6.3
hashable-1.2.3.3
haxr-3000.11.1.2
http-common-0.8.2.0
http-streams-0.8.3.3
http-types-0.9
io-streams-1.3.2.0
lens-family-core-1.2.0
lens-family-th-0.4.1.0
monads-tf-0.1.0.2
mtl-2.2.1
mtl-compat-0.2.1.3
network-2.6.2.1
network-uri-2.6.0.3
old-locale-1.0.0.7
old-time-1.1.0.3
openssl-streams-1.2.1.0
parsec-3.1.9
polyparse-1.11
primitive-0.6.1.0
pureMD5-2.1.2.1
random-1.1
regex-base-0.93.2
regex-posix-0.95.2
roshask-0.3
safe-0.3.9
scientific-0.3.4.4
snap-core-0.9.8.0
snap-server-0.9.5.1
stm-2.4.4
storable-record-0.0.3
storable-tuple-0.0.2
streaming-commons-0.1.15
syb-0.6
tagged-0.8.2
text-1.2.1.3
unix-compat-0.4.1.4
unordered-containers-0.2.5.1
uri-0.1.6.3
utf8-string-1.0.1.1
utility-ht-0.0.11
vector-0.11.0.0
vector-space-0.10.2
xml-1.3.14
zlib-0.6.1.1
zlib-bindings-0.1.1.5
zlib-enum-0.2.3.1
You can see if any of those are obviously different from what you have, or you can paste your ghc-pkg list
output and I can take a look, too, just in case there is an issue with a particular version of a particular library. The situation you're describing sounds like something going wrong with closing the socket, which I would expect to only be affected by the version of the network
package. You can see the sandbox I used for testing has network-2.6.2.1
installed. Does that line up with your configuration?
This is very odd. Looking at the output of ghc-pkg list
, I do not notice anything odd in my setting:
tatu@tatu-carbon:~/prog/ros/roshask$ ghc-pkg list
/opt/ghc/7.10.3/lib/ghc-7.10.2.20151118/package.conf.d
Cabal-1.22.4.0
array-0.5.1.0
base-4.8.2.0
bin-package-db-0.0.0.0
binary-0.7.5.0
bytestring-0.10.6.0
containers-0.5.6.2
deepseq-1.4.1.1
directory-1.2.2.0
filepath-1.4.0.0
ghc-7.10.2.20151118
ghc-prim-0.4.0.0
haskeline-0.7.2.1
hoopl-3.10.0.2
hpc-0.6.0.2
integer-gmp-1.0.0.0
pretty-1.1.2.0
process-1.2.3.0
rts-1.0
template-haskell-2.10.0.0
terminfo-0.4.0.1
time-1.5.0.1
transformers-0.4.2.0
unix-2.7.1.0
xhtml-3000.2.1
/home/tatu/.ghc/x86_64-linux-7.10.2.20151118/package.conf.d
Boolean-0.2.3
BoundedChan-1.0.3.0
HUnit-1.3.0.0
HaXml-1.25.3
HsOpenSSL-0.11.1.1
MemoTrie-0.6.4
MonadCatchIO-transformers-0.3.1.3
MonadRandom-0.4.1
NumInstances-1.4
QuickCheck-2.8.1
ROS-std-msgs-0.3
SafeSemaphore-0.10.1
StateVar-1.1.0.1
aeson-0.10.0.0
ansi-terminal-0.6.2.3
async-2.0.2
attoparsec-0.13.0.1
attoparsec-enumerator-0.3.4
base-compat-0.8.2
base-orphans-0.4.4
base64-bytestring-1.0.0.1
bifunctors-5.1
blaze-builder-0.4.0.1
blaze-builder-enumerator-0.2.1.0
bytestring-builder-0.10.6.0.0
bytestring-mmap-0.2.2
cabal-helper-0.6.2.0
case-insensitive-1.2.0.5
cereal-0.4.1.1
cereal-0.5.1.0
cmdargs-0.10.13
comonad-4.2.7.2
conduit-1.2.5.1
contravariant-1.3.3
cpphs-1.19.3
crypto-api-0.13.2
data-default-generics-0.3
distributive-0.4.4
djinn-ghc-0.0.2.3
djinn-lib-0.0.1.2
dlist-0.7.1.2
either-4.4.1
enclosed-exceptions-1.0.1.1
entropy-0.3.7
enumerator-0.4.20
exceptions-0.8.0.2
extensible-exceptions-0.1.1.4
extra-1.4.2
fclabels-2.0.2.3
filemanip-0.3.6.3
free-4.12.1
ghc-mod-5.4.0.0
ghc-paths-0.1.0.9
ghc-syb-utils-0.2.3
hashable-1.2.3.3
haskell-src-exts-1.16.0.1
haskell-src-exts-1.17.0
hasktags-0.69.1
haxr-3000.11.1.2
hlint-1.9.22
hlint-1.9.25
hscolour-1.23
http-common-0.8.2.0
http-streams-0.8.3.3
http-types-0.9
io-streams-1.3.2.0
json-0.9.1
lens-family-core-1.2.0
lens-family-th-0.4.1.0
lifted-base-0.2.3.6
mmorph-1.0.4
monad-control-1.0.0.4
monad-journal-0.7.1
monads-tf-0.1.0.2
mtl-2.2.1
mtl-compat-0.2.1.3
network-2.6.2.1
network-uri-2.6.0.3
old-locale-1.0.0.7
old-time-1.1.0.3
openssl-streams-1.2.1.0
parsec-3.1.9
pipes-4.1.7
polyparse-1.11
prelude-extras-0.4.0.2
primitive-0.6.1.0
profunctors-5.1.1
pureMD5-2.1.2.1
random-1.1
refact-0.3.0.2
regex-base-0.93.2
regex-posix-0.95.2
resourcet-1.1.6
roshask-0.3
safe-0.3.9
scientific-0.3.4.4
semigroupoids-5.0.0.4
semigroups-0.18.0.1
snap-core-0.9.8.0
snap-server-0.9.5.1
split-0.2.2
stm-2.4.4
storable-record-0.0.3
storable-tuple-0.0.2
streaming-commons-0.1.15
strict-0.3.2
stylish-haskell-0.5.14.4
syb-0.6
tagged-0.8.2
temporary-1.2.0.3
text-1.2.1.3
tf-random-0.5
transformers-base-0.4.4
transformers-compat-0.4.0.4
uniplate-1.6.12
unix-compat-0.4.1.4
unordered-containers-0.2.5.1
uri-0.1.6.3
utf8-string-1.0.1.1
utility-ht-0.0.11
vector-0.11.0.0
vector-space-0.10.2
void-0.7.1
xml-1.3.14
yaml-0.8.15.1
zlib-0.6.1.1
zlib-bindings-0.1.1.5
zlib-enum-0.2.3.1
I'd like to try out debugging this more deeply on my end. If you were to be able to reproduce this error, how would you proceed on testing this?
Since the bug is happening on disconnect, and it's spinning the CPU, I'd guess that something is looping trying to write to a disconnected socket. The node logic itself is rate limited, so it is most likely lower level.
We send the bytes here, and, just above that, we have the loop that I'm suspicious of. We should be catching exceptions there, and cleaning up the client connection.
You can see here we have some cleanup logic for each client, so we need run that when we catch an un-recoverable exception in the sending loop. So we need to pass clients
to serviceClient
, then, in the exception handler we add, run an STM transaction to find the appropriate cleanup action from clients
, and remove it from the list so it is not run again when the node is cleaning itself up.
Thank you for sharing your thoughts! That sounds reasonable. I will look more closely to this, even though my haskell-fu is not sufficient enough. Maybe I'll learn something, which is never a bad thing.
I'm a bit confused: Why wouldn't all the exception be caught in here, or, what cleanup1
catch is actually catching? What would be a probable reason why we have difficulties in reproducing the error in your setup? Would it be that I'm using a bit slower laptop that send message buffer can grow large enough that sendAll
takes so much time I am able to insert the exeption to the "correct" spot? I.e. pushing the node logic rate higher might introduce the error on your setup?
That catch
is indeed supposed to catch an error like this.
My inability to reproduce the issue is strange, and I can't explain it yet. Perhaps I'm completely on the wrong track.
A first step would be to stick putStrLn
s where we expect to see them at predictable intervals. For instance, in the talker
node's loop, and in the serviceClient
loop, and perhaps even in the existing exception handler you linked to.
I took the sample code from the wiki, where there is the chatter node. Executing that node successfully, I listened the topic with
rostopic echo /chatter
. After executing this for around 10 seconds, I quit the listener, but then after couple of seconds, application running the node crashed with Out of memory error. I was able to reproduce this error on my PC multiple times.top
command showed how both CPU core shoots up to 100% utility and memory allocation starts rising up quickly right after closing the listener. I tested that during the time when this behaviour has started, restarting the topic listener does not help the situation, messages are not send through, memory leak still exists.My current setup is as follow: Ubuntu 64bit 14.04LTS ghc 7.10.2 roshask commit e9bb32d, i.e. practically the latest ROS Jade Turtle compilation with -O2 flag. Package was created with
roshask create TestHask std_msgs
, and TestHask.hs was replaced with https://github.com/acowley/roshask/wiki/Tutorial:-Publisher-Subscriber#publisher