epoll busy loop - Githubissues

defsky commented 4 years ago

我写了个客户，启动500个协程去连接服务器后立刻关闭连接。发现有时候客户端这边连接已经关闭完了，服务端资源还没释放，就像下面这样：

top - 14:59:36 up 29 days, 18:10,  2 users,  load average: 1.01, 0.57, 0.26
Tasks: 268 total,   1 running, 267 sleeping,   0 stopped,   0 zombie
Cpu0  : 53.9%us, 46.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.7%us,  0.7%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  2.6%us,  0.3%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.3%us,  0.7%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1923764k total,  1737412k used,   186352k free,   191004k buffers
Swap:  4128760k total,    81104k used,  4047656k free,   892780k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5179 root      20   0  981m 6928 2960 S 100.5  0.4   3:11.67 mirserver
 2531 root      20   0  871m 362m 5408 S  2.6 19.3   1218:10 java
 5244 root      20   0  2704 1208  872 R  1.0  0.1   0:10.24 top

测试用的客户端代码是这样的：

func main(){
    loops, err := strconv.Atoi(os.Args[1])
    if err != nil {
        log.Fatalln(err)
    }

    success,failed := 0,0

    wg := &sync.WaitGroup{}
    for i:=0;i<loops;i++ {
        go func(){
            wg.Add(1)
            defer wg.Done()

            conn, err := net.DialTimeout("tcp", "login.afkplayer.com:7000", time.Second * 60)
            if err != nil {
                failed ++
                log.Println(err)
                return
            }
            success ++
            conn.Close()
        }()

    }

    wg.Wait()
    log.Println("Test complete...")
    log.Printf("Success: %d Failed: %d\n", success, failed)
}

执行结果是这样的：

E:\HardGo\src>go run main.go 500
2019/12/13 14:57:37 Test complete...
2019/12/13 14:57:37 Success: 500 Failed: 0

Allenxuxu commented 4 years ago

https://github.com/Allenxuxu/gev/blob/9d9d1c0251eb35666afb0b2839b1ac2ae0f138b6/connection/connection.go#L44

https://github.com/Allenxuxu/ringbuffer/blob/master/pool/pool.go

这是有意为之的。要等下次 gc 触发，才会释放。

defsky commented 4 years ago

https://github.com/Allenxuxu/gev/blob/9d9d1c0251eb35666afb0b2839b1ac2ae0f138b6/connection/connection.go#L44

https://github.com/Allenxuxu/ringbuffer/blob/master/pool/pool.go

这是有意为之的。

意思是这居高不下的CPU使用率，等有新连接进来了会重用吗？

Allenxuxu commented 4 years ago

https://github.com/Allenxuxu/gev/blob/9d9d1c0251eb35666afb0b2839b1ac2ae0f138b6/connection/connection.go#L44

https://github.com/Allenxuxu/ringbuffer/blob/master/pool/pool.go 这是有意为之的。

意思是这居高不下的CPU使用率，等有新连接进来了会重用吗？

CPU使用率居高不下？？？？我的意思内存没有很快降下来很正常。

defsky commented 4 years ago

https://github.com/Allenxuxu/gev/blob/9d9d1c0251eb35666afb0b2839b1ac2ae0f138b6/connection/connection.go#L44

https://github.com/Allenxuxu/ringbuffer/blob/master/pool/pool.go 这是有意为之的。

意思是这居高不下的CPU使用率，等有新连接进来了会重用吗？

CPU使用率居高不下？？？？我的意思内存没有很快降下来很正常。

就是那个 cpu使用率一直就像上面我发的top命令显示那样，一直100%

Allenxuxu commented 4 years ago

应该是客户端虽然断开了，但是服务端还在处理断开逻辑，你可以加上打印，看看所有链接断开后 cpu 使用率情况。

defsky commented 4 years ago

应该是客户端虽然断开了，但是服务端还在处理断开逻辑，你可以加上打印，看看所有链接断开后 cpu 使用率情况。

这都过了快两个小时了，一直没有新连接进来，但是服务端一直这样 cpu 100%

Allenxuxu commented 4 years ago

应该是客户端虽然断开了，但是服务端还在处理断开逻辑，你可以加上打印，看看所有链接断开后 cpu 使用率情况。

这都过了快两个小时了，一直没有新连接进来，但是服务端一直这样 cpu 100%

把完整代码贴出来，我等会有时间试试

defsky commented 4 years ago

应该是客户端虽然断开了，但是服务端还在处理断开逻辑，你可以加上打印，看看所有链接断开后 cpu 使用率情况。

这都过了快两个小时了，一直没有新连接进来，但是服务端一直这样 cpu 100%

把完整代码贴出来，我等会有时间试试

这里是代码： https://github.com/defsky/mircore.git

Allenxuxu commented 4 years ago

我这边一直 cpu 使用率一直很正常（macbook），你是啥环境？

Allenxuxu commented 4 years ago

我写了个客户，启动500个协程去连接服务器后立刻关闭连接。发现有时候客户端这边连接已经关闭完了，服务端资源还没释放，就像下面这样：

top - 14:59:36 up 29 days, 18:10,  2 users,  load average: 1.01, 0.57, 0.26
Tasks: 268 total,   1 running, 267 sleeping,   0 stopped,   0 zombie
Cpu0  : 53.9%us, 46.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.7%us,  0.7%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  2.6%us,  0.3%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.3%us,  0.7%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1923764k total,  1737412k used,   186352k free,   191004k buffers
Swap:  4128760k total,    81104k used,  4047656k free,   892780k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5179 root      20   0  981m 6928 2960 S 100.5  0.4   3:11.67 mirserver
 2531 root      20   0  871m 362m 5408 S  2.6 19.3   1218:10 java
 5244 root      20   0  2704 1208  872 R  1.0  0.1   0:10.24 top

测试用的客户端代码是这样的：

func main(){
    loops, err := strconv.Atoi(os.Args[1])
    if err != nil {
        log.Fatalln(err)
    }

    success,failed := 0,0

    wg := &sync.WaitGroup{}
    for i:=0;i<loops;i++ {
        go func(){
            wg.Add(1)
            defer wg.Done()

            conn, err := net.DialTimeout("tcp", "login.afkplayer.com:7000", time.Second * 60)
            if err != nil {
                failed ++
                log.Println(err)
                return
            }
            success ++
            conn.Close()
        }()

    }

    wg.Wait()
    log.Println("Test complete...")
    log.Printf("Success: %d Failed: %d\n", success, failed)
}

执行结果是这样的：

E:\HardGo\src>go run main.go 500
2019/12/13 14:57:37 Test complete...
2019/12/13 14:57:37 Success: 500 Failed: 0

另外，上面客户端有问题， wg.Add(1) 不能放在 go fun(){}() 里面执行。

defsky commented 4 years ago

我这边一直 cpu 使用率一直很正常（macbook），你是啥环境？

centos 6.4

Allenxuxu commented 4 years ago

我这边一直 cpu 使用率一直很正常（macbook），你是啥环境？

centos 6.4

这个现象是必现，还是偶现

defsky commented 4 years ago

我这边一直 cpu 使用率一直很正常（macbook），你是啥环境？

centos 6.4

这个现象是必现，还是偶现

客户端多运行几次就会出现，用5000个并发感觉更容易出现

top - 20:17:37 up 29 days, 23:28,  2 users,  load average: 1.61, 1.16, 1.09
Tasks: 267 total,   1 running, 266 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  1.3%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 53.5%us, 46.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 73.9%us, 26.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  4.3%us,  1.0%sy,  0.0%ni, 86.1%id,  8.6%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1923764k total,  1867108k used,    56656k free,    99064k buffers
Swap:  4128760k total,    81900k used,  4046860k free,  1123336k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6755 root      20   0  968m 8352 2980 S 201.6  0.4   3:57.12 mirserver
 2531 root      20   0  871m 361m 4428 S  4.0 19.2   1227:48 java
 6775 root      20   0  2704 1208  872 R  1.0  0.1   0:03.65 top

Allenxuxu commented 4 years ago

我这边一直 cpu 使用率一直很正常（macbook），你是啥环境？

centos 6.4

这个现象是必现，还是偶现

客户端多运行几次就会出现，用5000个并发感觉更容易出现

top - 20:17:37 up 29 days, 23:28,  2 users,  load average: 1.61, 1.16, 1.09
Tasks: 267 total,   1 running, 266 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  1.3%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 53.5%us, 46.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 73.9%us, 26.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  4.3%us,  1.0%sy,  0.0%ni, 86.1%id,  8.6%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1923764k total,  1867108k used,    56656k free,    99064k buffers
Swap:  4128760k total,    81900k used,  4046860k free,  1123336k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6755 root      20   0  968m 8352 2980 S 201.6  0.4   3:57.12 mirserver
 2531 root      20   0  871m 361m 4428 S  4.0 19.2   1227:48 java
 6775 root      20   0  2704 1208  872 R  1.0  0.1   0:03.65 top

提供下你完整的客户端测试代码和复现步骤，我晚上回家用 ubuntu 试试。

defsky commented 4 years ago

我这边一直 cpu 使用率一直很正常（macbook），你是啥环境？

centos 6.4

这个现象是必现，还是偶现

客户端多运行几次就会出现，用5000个并发感觉更容易出现

top - 20:17:37 up 29 days, 23:28,  2 users,  load average: 1.61, 1.16, 1.09
Tasks: 267 total,   1 running, 266 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  1.3%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 53.5%us, 46.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 73.9%us, 26.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  4.3%us,  1.0%sy,  0.0%ni, 86.1%id,  8.6%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1923764k total,  1867108k used,    56656k free,    99064k buffers
Swap:  4128760k total,    81900k used,  4046860k free,  1123336k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6755 root      20   0  968m 8352 2980 S 201.6  0.4   3:57.12 mirserver
 2531 root      20   0  871m 361m 4428 S  4.0 19.2   1227:48 java
 6775 root      20   0  2704 1208  872 R  1.0  0.1   0:03.65 top

我这边一直 cpu 使用率一直很正常（macbook），你是啥环境？

centos 6.4

这个现象是必现，还是偶现

客户端多运行几次就会出现，用5000个并发感觉更容易出现

top - 20:17:37 up 29 days, 23:28,  2 users,  load average: 1.61, 1.16, 1.09
Tasks: 267 total,   1 running, 266 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  1.3%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 53.5%us, 46.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 73.9%us, 26.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  4.3%us,  1.0%sy,  0.0%ni, 86.1%id,  8.6%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1923764k total,  1867108k used,    56656k free,    99064k buffers
Swap:  4128760k total,    81900k used,  4046860k free,  1123336k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6755 root      20   0  968m 8352 2980 S 201.6  0.4   3:57.12 mirserver
 2531 root      20   0  871m 361m 4428 S  4.0 19.2   1227:48 java
 6775 root      20   0  2704 1208  872 R  1.0  0.1   0:03.65 top

提供下你完整的客户端测试代码和复现步骤，我晚上回家用 ubuntu 试试。

完整的代码都在 https://github.com/defsky/mircore 感觉可能是有些连接不能触发OnClose回调，所以一直没有关闭，测试客户端多运行几次后，服务端里面就会有些连接一直不关闭。而且出现这种情况的时候，server.Stop也会卡住，没法停止。只能kill

Allenxuxu commented 4 years ago

老哥，这个 bug 应该是修复了，你有空拉下最新的代码，再测一下看看。

https://github.com/Allenxuxu/gev/releases/tag/v0.1.6

defsky commented 4 years ago

老哥，这个 bug 应该是修复了，你有空拉下最新的代码，再测一下看看。

https://github.com/Allenxuxu/gev/releases/tag/v0.1.6

我用测试客户端试了一下，cpu持续高负载的情况看起来是解决了。不过客户端关闭后，服务端还是会有连接不能关闭

Allenxuxu commented 4 years ago

你如何知道，连接没有断开？

defsky commented 4 years ago

你如何知道，连接没有断开？

服务端有个定时任务，定时输出活动连接的数目。每个连接关闭的时候都会在OnClose回调里面将连接移除。但是客户端已经关闭了，活动连接数还不为0，说明就还有连接没有触发到OnClose回调吧

//OnClose callback when connection close
func (s *Realm) OnClose(c *connection.Connection) {
    slog.Printf("Connection closed from: %s", c.PeerAddr())

    e := c.Context().(*list.Element)

    s.mtx.Lock()
    s.conns.Remove(e)
    s.mtx.Unlock()
}

Allenxuxu commented 4 years ago

你如何知道，连接没有断开？

服务端有个定时任务，定时输出活动连接的数目。每个连接关闭的时候都会在OnClose回调里面将连接移除。但是客户端已经关闭了，活动连接数还不为0，说明就还有连接没有触发到OnClose回调吧
//OnClose callback when connection close
func (s *Realm) OnClose(c *connection.Connection) {
  slog.Printf("Connection closed from: %s", c.PeerAddr())

  e := c.Context().(*list.Element)

  s.mtx.Lock()
  s.conns.Remove(e)
  s.mtx.Unlock()
}

我这边测试一直正常啊，是不是你服务哪边有问题？

这是我测试服务端： https://github.com/Allenxuxu/gev/commit/85454ecbeb9920821d70e5cc461425f5b606f353

defsky commented 4 years ago

在OnConnect函数里面加一个Send操作就有毛病了，就像这样：

func (s *example) OnConnect(c *connection.Connection) {
    s.Count.Add(1)
    c.Send([]byte("hello"))
    //log.Println(" OnConnect ： ", c.PeerAddr())
}

效果像这样：

[root@devops echo]# go run echo-server.go
go: finding github.com/Allenxuxu/toolkit latest
2019/12/17 15:37:33 [Gev] connections :0
2019/12/17 15:37:35 [Gev] connections :0
2019/12/17 15:37:37 [Gev] connections :327
2019/12/17 15:37:39 [Gev] connections :-131
2019/12/17 15:37:41 [Gev] connections :-1016
2019/12/17 15:37:43 [Gev] connections :-1016
2019/12/17 15:37:45 [Gev] connections :-1016
2019/12/17 15:37:47 [Gev] connections :-1034
2019/12/17 15:37:49 [Gev] connections :-1034
2019/12/17 15:37:51 [Gev] connections :-1034
2019/12/17 15:37:53 [Gev] connections :-1034
2019/12/17 15:37:55 [Gev] connections :-1034

你试试呢？

Allenxuxu commented 4 years ago

在OnConnect函数里面加一个Send操作就有毛病了，就像这样：

func (s *example) OnConnect(c *connection.Connection) {
  s.Count.Add(1)
  c.Send([]byte("hello"))
  //log.Println(" OnConnect ： ", c.PeerAddr())
}

效果像这样：

[root@devops echo]# go run echo-server.go
go: finding github.com/Allenxuxu/toolkit latest
2019/12/17 15:37:33 [Gev] connections :0
2019/12/17 15:37:35 [Gev] connections :0
2019/12/17 15:37:37 [Gev] connections :327
2019/12/17 15:37:39 [Gev] connections :-131
2019/12/17 15:37:41 [Gev] connections :-1016
2019/12/17 15:37:43 [Gev] connections :-1016
2019/12/17 15:37:45 [Gev] connections :-1016
2019/12/17 15:37:47 [Gev] connections :-1034
2019/12/17 15:37:49 [Gev] connections :-1034
2019/12/17 15:37:51 [Gev] connections :-1034
2019/12/17 15:37:53 [Gev] connections :-1034
2019/12/17 15:37:55 [Gev] connections :-1034

你试试呢？

重复 close ，这是个 bug，修复了，老哥再试试呢

https://github.com/Allenxuxu/gev/releases/tag/v0.1.7

Allenxuxu commented 4 years ago

@defsky 试了么，老哥

defsky commented 4 years ago

@defsky 试了么，老哥

好像回复没成功呢。。。现在OnClose回调里面获取Context会崩溃

Allenxuxu commented 4 years ago

ok，看来 busy loop 的问题修复了。

Context 是我的问题，我不小心把之前修复的代码又给还原了😂。

Context 相关 issue： https://github.com/Allenxuxu/gev/issues/24

已经修复了： https://github.com/Allenxuxu/gev/releases/tag/v0.1.8

Allenxuxu / gev

epoll busy loop #25