acowley / roshask

Haskell client library for the ROS robotics framework.
BSD 3-Clause "New" or "Revised" License
107 stars 18 forks source link

Out of memory after closing a topic listener #42

Open tjtolon opened 9 years ago

tjtolon commented 9 years ago

I took the sample code from the wiki, where there is the chatter node. Executing that node successfully, I listened the topic with rostopic echo /chatter. After executing this for around 10 seconds, I quit the listener, but then after couple of seconds, application running the node crashed with Out of memory error. I was able to reproduce this error on my PC multiple times. top command showed how both CPU core shoots up to 100% utility and memory allocation starts rising up quickly right after closing the listener. I tested that during the time when this behaviour has started, restarting the topic listener does not help the situation, messages are not send through, memory leak still exists.

My current setup is as follow: Ubuntu 64bit 14.04LTS ghc 7.10.2 roshask commit e9bb32d, i.e. practically the latest ROS Jade Turtle compilation with -O2 flag. Package was created with roshask create TestHask std_msgs, and TestHask.hs was replaced with https://github.com/acowley/roshask/wiki/Tutorial:-Publisher-Subscriber#publisher

acowley commented 9 years ago

Whoa, that's a critical bug! I'm going to have trouble getting any computer time today, but I'll get into this as soon as possible.

tjtolon commented 9 years ago

I was able to have this bug not to come up a couple of times randomly, further indicating a problem in concurrency.

What setup do you make for debugging/profiling?

acowley commented 9 years ago

There is a vagrant box in the repo that I try to keep reproducible. However, it was on GHC 7.8.3, so I updated it to 7.10.2 this morning for testing this issue. I'll update the vagrant setup when I can see things working.

acowley commented 9 years ago

Okay, I had a chance to try things out and couldn't reproduce. The vagrant box is Ubuntu 14.04.1 LTS (trusty) now with GHC-7.10.2, However, that's with ROS Indigo, so perhaps it's a problem with Jade Turtle. Just to be 100% clear as to what I tried:

Starting directory: /home/vagrant/roshask

Now, with roscore running, I can use two shells to run both cabal run listener and cabal run talker. I also tried using rostopic echo /chatter instead of, and in parallel with, the listener program, and kept an eye on top while doing so. I didn't see any runaway memory use when stopping and starting things.

So I didn't completely recreate your setup in that I used the existing Examples and a different version of ROS. Can you give the Examples a quick try to see if they exhibit the same problem?

I'll try getting Jade Turtle up and running to see if that's the issue.

tjtolon commented 9 years ago

I made some tries with the Examples/PubSub with the only difference that I have haskell libraries installed directly to ¨/.cabal, without the cabal sandbox.

acowley commented 9 years ago

Thanks for helping with this! I'll have to try with jade to see if a new rostopic is doing something funny. I got jade installed, but haven't had a chance to try it yet.

acowley commented 9 years ago

I still can't reproduce this with ROS Jade Turtle. I ran the talker node, then started and stopped rostopic echo /chatter 10 times, each time letting it run for at least 20s. I never saw talker using more than 1% CPU. The good news is that it's not a compatibility break in the ROS transport protocol, the bad news is I haven't fixed the issue.

Since one main difference between our test setups is the use of a sandbox, here's some relevant output:

vagrant@vagrant-ubuntu-trusty-64:~/roshask/Examples/PubSub$ cabal sandbox hc-pkg list
/opt/ghc/7.10.2/lib/ghc-7.10.2/package.conf.d
   Cabal-1.22.4.0
   array-0.5.1.0
   base-4.8.1.0
   bin-package-db-0.0.0.0
   binary-0.7.5.0
   bytestring-0.10.6.0
   containers-0.5.6.2
   deepseq-1.4.1.1
   directory-1.2.2.0
   filepath-1.4.0.0
   ghc-7.10.2
   ghc-prim-0.4.0.0
   haskeline-0.7.2.1
   hoopl-3.10.0.2
   hpc-0.6.0.2
   integer-gmp-1.0.0.0
   pretty-1.1.2.0
   process-1.2.3.0
   rts-1.0
   template-haskell-2.10.0.0
   terminfo-0.4.0.1
   time-1.5.0.1
   transformers-0.4.2.0
   unix-2.7.1.0
   xhtml-3000.2.1
/home/vagrant/roshask/.cabal-sandbox/x86_64-linux-ghc-7.10.2-packages.conf.d
   Boolean-0.2.3
   BoundedChan-1.0.3.0
   HUnit-1.3.0.0
   HaXml-1.25.3
   HsOpenSSL-0.11.1.1
   MemoTrie-0.6.4
   MonadCatchIO-transformers-0.3.1.3
   NumInstances-1.4
   ROS-std-msgs-0.3
   SafeSemaphore-0.10.1
   aeson-0.10.0.0
   attoparsec-0.13.0.1
   attoparsec-enumerator-0.3.4
   base-compat-0.8.2
   base64-bytestring-1.0.0.1
   blaze-builder-0.4.0.1
   blaze-builder-enumerator-0.2.1.0
   bytestring-builder-0.10.6.0.0
   bytestring-mmap-0.2.2
   case-insensitive-1.2.0.5
   cereal-0.5.1.0
   crypto-api-0.13.2
   data-default-generics-0.3
   dlist-0.7.1.2
   entropy-0.3.7
   enumerator-0.4.20
   extensible-exceptions-0.1.1.4
   filemanip-0.3.6.3
   hashable-1.2.3.3
   haxr-3000.11.1.2
   http-common-0.8.2.0
   http-streams-0.8.3.3
   http-types-0.9
   io-streams-1.3.2.0
   lens-family-core-1.2.0
   lens-family-th-0.4.1.0
   monads-tf-0.1.0.2
   mtl-2.2.1
   mtl-compat-0.2.1.3
   network-2.6.2.1
   network-uri-2.6.0.3
   old-locale-1.0.0.7
   old-time-1.1.0.3
   openssl-streams-1.2.1.0
   parsec-3.1.9
   polyparse-1.11
   primitive-0.6.1.0
   pureMD5-2.1.2.1
   random-1.1
   regex-base-0.93.2
   regex-posix-0.95.2
   roshask-0.3
   safe-0.3.9
   scientific-0.3.4.4
   snap-core-0.9.8.0
   snap-server-0.9.5.1
   stm-2.4.4
   storable-record-0.0.3
   storable-tuple-0.0.2
   streaming-commons-0.1.15
   syb-0.6
   tagged-0.8.2
   text-1.2.1.3
   unix-compat-0.4.1.4
   unordered-containers-0.2.5.1
   uri-0.1.6.3
   utf8-string-1.0.1.1
   utility-ht-0.0.11
   vector-0.11.0.0
   vector-space-0.10.2
   xml-1.3.14
   zlib-0.6.1.1
   zlib-bindings-0.1.1.5
   zlib-enum-0.2.3.1

You can see if any of those are obviously different from what you have, or you can paste your ghc-pkg list output and I can take a look, too, just in case there is an issue with a particular version of a particular library. The situation you're describing sounds like something going wrong with closing the socket, which I would expect to only be affected by the version of the network package. You can see the sandbox I used for testing has network-2.6.2.1 installed. Does that line up with your configuration?

tjtolon commented 9 years ago

This is very odd. Looking at the output of ghc-pkg list, I do not notice anything odd in my setting:

tatu@tatu-carbon:~/prog/ros/roshask$ ghc-pkg list
/opt/ghc/7.10.3/lib/ghc-7.10.2.20151118/package.conf.d
   Cabal-1.22.4.0
   array-0.5.1.0
   base-4.8.2.0
   bin-package-db-0.0.0.0
   binary-0.7.5.0
   bytestring-0.10.6.0
   containers-0.5.6.2
   deepseq-1.4.1.1
   directory-1.2.2.0
   filepath-1.4.0.0
   ghc-7.10.2.20151118
   ghc-prim-0.4.0.0
   haskeline-0.7.2.1
   hoopl-3.10.0.2
   hpc-0.6.0.2
   integer-gmp-1.0.0.0
   pretty-1.1.2.0
   process-1.2.3.0
   rts-1.0
   template-haskell-2.10.0.0
   terminfo-0.4.0.1
   time-1.5.0.1
   transformers-0.4.2.0
   unix-2.7.1.0
   xhtml-3000.2.1
/home/tatu/.ghc/x86_64-linux-7.10.2.20151118/package.conf.d
   Boolean-0.2.3
   BoundedChan-1.0.3.0
   HUnit-1.3.0.0
   HaXml-1.25.3
   HsOpenSSL-0.11.1.1
   MemoTrie-0.6.4
   MonadCatchIO-transformers-0.3.1.3
   MonadRandom-0.4.1
   NumInstances-1.4
   QuickCheck-2.8.1
   ROS-std-msgs-0.3
   SafeSemaphore-0.10.1
   StateVar-1.1.0.1
   aeson-0.10.0.0
   ansi-terminal-0.6.2.3
   async-2.0.2
   attoparsec-0.13.0.1
   attoparsec-enumerator-0.3.4
   base-compat-0.8.2
   base-orphans-0.4.4
   base64-bytestring-1.0.0.1
   bifunctors-5.1
   blaze-builder-0.4.0.1
   blaze-builder-enumerator-0.2.1.0
   bytestring-builder-0.10.6.0.0
   bytestring-mmap-0.2.2
   cabal-helper-0.6.2.0
   case-insensitive-1.2.0.5
   cereal-0.4.1.1
   cereal-0.5.1.0
   cmdargs-0.10.13
   comonad-4.2.7.2
   conduit-1.2.5.1
   contravariant-1.3.3
   cpphs-1.19.3
   crypto-api-0.13.2
   data-default-generics-0.3
   distributive-0.4.4
   djinn-ghc-0.0.2.3
   djinn-lib-0.0.1.2
   dlist-0.7.1.2
   either-4.4.1
   enclosed-exceptions-1.0.1.1
   entropy-0.3.7
   enumerator-0.4.20
   exceptions-0.8.0.2
   extensible-exceptions-0.1.1.4
   extra-1.4.2
   fclabels-2.0.2.3
   filemanip-0.3.6.3
   free-4.12.1
   ghc-mod-5.4.0.0
   ghc-paths-0.1.0.9
   ghc-syb-utils-0.2.3
   hashable-1.2.3.3
   haskell-src-exts-1.16.0.1
   haskell-src-exts-1.17.0
   hasktags-0.69.1
   haxr-3000.11.1.2
   hlint-1.9.22
   hlint-1.9.25
   hscolour-1.23
   http-common-0.8.2.0
   http-streams-0.8.3.3
   http-types-0.9
   io-streams-1.3.2.0
   json-0.9.1
   lens-family-core-1.2.0
   lens-family-th-0.4.1.0
   lifted-base-0.2.3.6
   mmorph-1.0.4
   monad-control-1.0.0.4
   monad-journal-0.7.1
   monads-tf-0.1.0.2
   mtl-2.2.1
   mtl-compat-0.2.1.3
   network-2.6.2.1
   network-uri-2.6.0.3
   old-locale-1.0.0.7
   old-time-1.1.0.3
   openssl-streams-1.2.1.0
   parsec-3.1.9
   pipes-4.1.7
   polyparse-1.11
   prelude-extras-0.4.0.2
   primitive-0.6.1.0
   profunctors-5.1.1
   pureMD5-2.1.2.1
   random-1.1
   refact-0.3.0.2
   regex-base-0.93.2
   regex-posix-0.95.2
   resourcet-1.1.6
   roshask-0.3
   safe-0.3.9
   scientific-0.3.4.4
   semigroupoids-5.0.0.4
   semigroups-0.18.0.1
   snap-core-0.9.8.0
   snap-server-0.9.5.1
   split-0.2.2
   stm-2.4.4
   storable-record-0.0.3
   storable-tuple-0.0.2
   streaming-commons-0.1.15
   strict-0.3.2
   stylish-haskell-0.5.14.4
   syb-0.6
   tagged-0.8.2
   temporary-1.2.0.3
   text-1.2.1.3
   tf-random-0.5
   transformers-base-0.4.4
   transformers-compat-0.4.0.4
   uniplate-1.6.12
   unix-compat-0.4.1.4
   unordered-containers-0.2.5.1
   uri-0.1.6.3
   utf8-string-1.0.1.1
   utility-ht-0.0.11
   vector-0.11.0.0
   vector-space-0.10.2
   void-0.7.1
   xml-1.3.14
   yaml-0.8.15.1
   zlib-0.6.1.1
   zlib-bindings-0.1.1.5
   zlib-enum-0.2.3.1

I'd like to try out debugging this more deeply on my end. If you were to be able to reproduce this error, how would you proceed on testing this?

acowley commented 9 years ago

Since the bug is happening on disconnect, and it's spinning the CPU, I'd guess that something is looping trying to write to a disconnected socket. The node logic itself is rate limited, so it is most likely lower level.

We send the bytes here, and, just above that, we have the loop that I'm suspicious of. We should be catching exceptions there, and cleaning up the client connection.

You can see here we have some cleanup logic for each client, so we need run that when we catch an un-recoverable exception in the sending loop. So we need to pass clients to serviceClient, then, in the exception handler we add, run an STM transaction to find the appropriate cleanup action from clients, and remove it from the list so it is not run again when the node is cleaning itself up.

tjtolon commented 9 years ago

Thank you for sharing your thoughts! That sounds reasonable. I will look more closely to this, even though my haskell-fu is not sufficient enough. Maybe I'll learn something, which is never a bad thing.

I'm a bit confused: Why wouldn't all the exception be caught in here, or, what cleanup1 catch is actually catching? What would be a probable reason why we have difficulties in reproducing the error in your setup? Would it be that I'm using a bit slower laptop that send message buffer can grow large enough that sendAll takes so much time I am able to insert the exeption to the "correct" spot? I.e. pushing the node logic rate higher might introduce the error on your setup?

acowley commented 9 years ago

That catch is indeed supposed to catch an error like this.

My inability to reproduce the issue is strange, and I can't explain it yet. Perhaps I'm completely on the wrong track.

A first step would be to stick putStrLns where we expect to see them at predictable intervals. For instance, in the talker node's loop, and in the serviceClient loop, and perhaps even in the existing exception handler you linked to.