CodisLabs / jodis

A java client for codis based on Jedis and Curator
MIT License
216 stars 97 forks source link

Thread-safety problem in RoundRobinJedisPool #55

Closed zdxiq125 closed 5 years ago

zdxiq125 commented 6 years ago

First of all, issue raised up from io.codis.jodis:0.4.1. We have frequently ran into exception Pool not open which threw by RedisPool

RoundRobinJedisPool.resetPools is not thread-safety. Although RoundRobinJedisPool#pools is carefully declared volatile and ImmutableList, the inner pool (thus, JedisPool) can still be modified. At the end of resetPools, it starts closing pool(s) that not exists under zookeeper path:

        for (PooledObject pool: addr2Pool.values()) {
            LOG.info("Remove proxy: " + pool.addr);
            pool.pool.close();
        }

this operation may close pool(s) already inited by other thread (and successfully assigned to RoundRobinJedisPool#pools) by mistake. For example:

This makes pools contains a closed JedisPool c.

Apache9 commented 6 years ago

Is it possible that two threads want to call resetPools concurrently? IIRC, the curator will always uses a single threaded thread pool to execute the callbacks. If not, then we need to add a synchronized modifier to the resetPools method.

zdxiq125 commented 6 years ago

@Apache9 THANKS for your reply. I will check the curator doc.

But we doubt that the ctor of RoundRobinJedisPool may lead to multi-access to resetPools. In our case, we call the build() to create an instance (auto wired by spring actually).

    pool = RoundRobinJedisPool.create()
        .curatorClient(zkServer, zkTimeout)
        .zkProxyDir(zkCodisProxyDir)
        .build();

There will be two thread finally reach resetPools which is beyond our control. To illustrate,

private RoundRobinJedisPool(CuratorFramework curatorClient, boolean closeCurator,
            String zkProxyDir, JedisPoolConfig poolConfig, int connectionTimeoutMs, int soTimeoutMs,
            String password, int database, String clientName) {
        ···
        watcher = new PathChildrenCache(curatorClient, zkProxyDir, true);
        watcher.getListenable().addListener(new PathChildrenCacheListener() {

            private void logEvent(PathChildrenCacheEvent event) {
                ···
                LOG.info(msg.toString());
            }

            @Override
            public void childEvent(CuratorFramework client, PathChildrenCacheEvent event)
                    throws Exception {
                logEvent(event);
                if (RESET_TYPES.contains(event.getType())) {
                    resetPools();                                        // NOTE this
                }
            }
        });
        try {
            watcher.start(BUILD_INITIAL_CACHE);      //Thread B
        } catch (Exception e) {
            close();
            throw new JedisException(e);
        }
        resetPools();                                                    //Thread A
    }

During our application start-up, childEvent comes at the same time (we can see the event log, type CHILD_ADD which RESET_TYPES contains).

Or else, is there any misunderstanding about the mechanism?

weilinqwe commented 6 years ago

我正好碰到这个问题,启动的时候, 主线程resetpool, 然后listener线程听到node add消息也resetpool,造成两倍的连接数,pools的连接池里还是对的,但有多余的pool生成了没被删除

2018-07-26 15:53:37,883 [main-EventThread] INFO  o.a.c.f.s.ConnectionStateManager - State change: CONNECTED
2018-07-26 15:53:38,023 [Curator-PathChildrenCache-0] INFO  c.s.i.i.jodis.RoundRobinJedisPool - Receive child event: type=CONNECTION_RECONNECTED, no data
2018-07-26 15:53:38,057 [Curator-PathChildrenCache-0] INFO  c.s.i.i.jodis.RoundRobinJedisPool - Receive child event: type=CHILD_ADDED, path=/jodis/codis-iptv/proxy-72f4013ead5c6a213432478a03047397, stat=8069,8069,1532499605339,1532499605339,0,0,0,100077600475054269,201,0,8069
, bytes length=201
2018-07-26 15:53:38,207 [Curator-PathChildrenCache-0] INFO  c.s.i.i.jodis.RoundRobinJedisPool - Add new proxy: 192.168.11.72:29000
2018-07-26 15:53:38,208 [Curator-PathChildrenCache-0] INFO  c.s.i.i.jodis.RoundRobinJedisPool - Add new proxy: 192.168.11.89:29000
2018-07-26 15:53:38,210 [Curator-PathChildrenCache-0] INFO  c.s.i.i.jodis.RoundRobinJedisPool - Receive child event: type=CHILD_ADDED, path=/jodis/codis-iptv/proxy-adb4cbd6c40c8705d99eaf4409588067, stat=8091,8091,1532581151277,1532581151277,0,0,0,100077600475054274,201,0,8091
, bytes length=201
2018-07-26 15:53:38,210 [main] INFO  c.s.i.i.jodis.RoundRobinJedisPool - Add new proxy: 192.168.11.72:29000
2018-07-26 15:53:38,213 [main] INFO  c.s.i.i.jodis.RoundRobinJedisPool - Add new proxy: 192.168.11.89:29000
Apache9 commented 6 years ago

OK, so the problem is the resetPools during start up. Let me prepare a fix.

weilinqwe commented 6 years ago

a simple solution: first start the cache and resetpools, then add the listenner. to avoid the small possibility of concurrency, sleep a while between the two action.

zdxiq125 commented 6 years ago

@weilinqwe Maybe just add a synchonized modifier to resetPools will be more effective. Since the curator will exactly use single thread executor to process child event.

public class PathChildrenCache implements Closeable
{
    private final CloseableExecutorService executorService;
    ···
    public PathChildrenCache(CuratorFramework client, String path, boolean cacheData)
    {
        this(client, path, cacheData, false, new CloseableExecutorService(Executors.newSingleThreadExecutor(defaultThreadFactory), true));
    }
    ···
}

So, synchonized will avoid the currency during instantiation with little drawback.

weilinqwe commented 6 years ago

understood. startup is the only chance involving concurrency.

zdxiq125 commented 5 years ago

Has it already been fixed? @Apache9

Apache9 commented 5 years ago

Please try 0.5.1.