file descriptor leak when out of direct memory

GoogleCodeExporter commented 8 years ago

Using version 2.7.3 I have encountered the following problem:

When you run out of direct memory you end up leaking unconnected sockets. The 
reason, I think, is that we open the socket, then allocate the direct memory 
buffers, and then we try the connect. If something goes wrong allocating the 
buffers, the socket is leaked. If you run lsof on the process you see a bunch 
of items which say "can't identify protocol".

I am not 100% sure about this but I am close to that. I haven't reproduced it 
yet by crashing on purpose and seeing if that is indeed the cause of our 
occasional problem of seeing tons of "can't identify protocol" socket entries 
in lsof, as our server dies ...

I think the fix is to allocate the direct memory first. I don't think you want 
to be catching that error.

In our case we limit direct memory to 256M.

Original issue reported on code.google.com by jpa...@gmail.com on 23 Jan 2012 at 8:20

GoogleCodeExporter commented 8 years ago

I have reproduced this by throwing an exception at the same point of the 
program as the out of memory is occurring, and sure enough I leaked a bunch of 
unconnected sockets which say "can't identify protocol" when you use lsof to 
list them.

I think we have got to recycle the direct memory buffers in the event that this 
occurs, otherwise the wrong scenario, with the run GC parameters, will result 
in out of direct memory errors. E.g., we're running with the G1 collector. I 
just did a test with finalizers() and they are never run ... maybe eventually 
but I had over 400k objects that needed finalizing and they were never run. 
Running with the concurrent mark and sweep collector cleared those puppies up 
almost as quickly as I created them.

Original comment by jpa...@gmail.com on 23 Jan 2012 at 3:35

GoogleCodeExporter commented 8 years ago

Thanks for the info, we'll have to look into this.

Which platform and JVM did you test with?

Original comment by ingen...@gmail.com on 5 Feb 2012 at 7:45

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

Oh sorry for the delay. I kinda gave up on anybody noticing and thought I must 
have used the wrong forum.

It was linux basically, and the Sun/Oracle 1.6 VM.

Original comment by jpa...@gmail.com on 25 Feb 2012 at 4:57

GoogleCodeExporter commented 8 years ago

Any chance you still have your test available?

Original comment by ingen...@gmail.com on 16 Mar 2012 at 5:15

GoogleCodeExporter commented 8 years ago

I think I just threw an OutOfMemoryError in the createMemcacheNode method. 
Since the socket was already created, the error caused the leak. In the code 
below I have overridden the same method to catch the out of memory error and 
close the socket.

    static class BinaryKetamaConnectionFactory extends BinaryConnectionFactory {
        /**
         * Create a KetamaConnectionFactory with the given maximum operation
         * queue length, and the given read buffer size.
         *
         * @param opQueueMaxBlockTime the maximum time to block waiting for op
         *        queue operations to complete, in milliseconds
         */
        public BinaryKetamaConnectionFactory(int qLen, int bufSize,
                                             long opQueueMaxBlockTime) {
            super(qLen, bufSize, HashAlgorithm.KETAMA_HASH);
        }

    public @Override MemcachedNode createMemcachedNode(SocketAddress sa, SocketChannel c, int bufSize)
        {
            try {
                return super.createMemcachedNode(sa, c, bufSize);
            } catch (java.lang.OutOfMemoryError e) {
                // Direct buffer memory
                try {
                    Log.warning("closing leaked socket on OutOfMemoryError: %s", c);
                    c.close();
                } catch (IOException e1) {}
                throw e;
            }
        }

        /**
         * Create a KetamaConnectionFactory with the default parameters.
         */
        public BinaryKetamaConnectionFactory() {
            this(DEFAULT_OP_QUEUE_LEN, DEFAULT_READ_BUFFER_SIZE,
                 DEFAULT_OP_QUEUE_MAX_BLOCK_TIME);
        }

        /* (non-Javadoc)
         * @see net.spy.memcached.ConnectionFactory#createLocator(java.util.List)
         */
        @Override
        public NodeLocator createLocator(List<MemcachedNode> nodes) {
            return new KetamaNodeLocator(nodes, getHashAlg());
        }
    }

Original comment by jpa...@gmail.com on 22 Mar 2012 at 10:42

bigdata4u / spymemcached

file descriptor leak when out of direct memory #231