chrislusf / glow

Glow is an easy-to-use distributed computation system written in Go, similar to Hadoop Map Reduce, Spark, Flink, Storm, etc. I am also working on another similar pure Go system, https://github.com/chrislusf/gleam , which is more flexible and more performant.
3.2k stars 248 forks source link

Failed to create a queue on disk error #43

Closed radike closed 8 years ago

radike commented 8 years ago

I get the error Failed to create a queue on disk: Failed to open ... no such file or directory when a program is repeatedly executed (in distributed mode). It seems that the problem is related to size of the cluster - it happens often in cluster of 20 computers, but a cluster of 10 computers works fine.

The problem is caused by RotatingFileStore#init, which fails to open the old log files, and therefore, CreateNamedDatasetShard returns nil.

I think that the RotatingFileStore#init should not open the old log files, because they should already be removed by the previous statement in CreateNamedDatasetShard (done by m.doDelete). Isn't the problem caused by the ioutil.ReadDir(l.dir()), which returns old view of the file system?

Top of the stack trace:

2016/04/29 12:33:51 Failed to create a queue on disk: Failed to open bbaf2eac-ct-0-ds-2-shard-1111-8816-2016-04-29T12-01-44.112.dat: open bbaf2eac-ct-0-ds-2-shard-1111-8816-2016-04-29T12-01-44.112.dat: no such file or directory panic: runtime error: invalid memory address or nil pointer dereference [signal 0xb code=0x1 addr=0x20 pc=0x4ca2f8]

goroutine 79 [running]: github.com/chrislusf/glow/util.WriteBytes(0x0, 0x0, 0xc8201a2d90, 0x4, 0x4, 0xc82040c720) /corpora/programy/manatee-go/git/src/github.com/chrislusf/glow/util/read_write.go:64 +0xf8 github.com/chrislusf/glow/agent.(_AgentServer).handleLocalWriteConnection(0xc82001c540, 0x7fe6fb99c2f0, 0xc8201b81c0, 0xc8201a1b80, 0x1d) /corpora/programy/manatee-go/git/src/github.com/chrislusf/glow/agent/agent_server_write.go:25 +0x1bf github.com/chrislusf/glow/agent.(_AgentServer).handleRequest(0xc82001c540, 0x7fe6fb99c290, 0xc8201b81c0) /corpora/programy/manatee-go/git/src/github.com/chrislusf/glow/agent/agent_server.go:172 +0x854 github.com/chrislusf/glow/agent.(_AgentServer).Run.func2(0xc82001c540, 0x7fe6fb99c290, 0xc8201b81c0) /corpora/programy/manatee-go/git/src/github.com/chrislusf/glow/agent/agent_server.go:135 +0xa1 created by github.com/chrislusf/glow/agent.(_AgentServer).Run /corpora/programy/manatee-go/git/src/github.com/chrislusf/glow/agent/agent_server.go:136 +0x367

chrislusf commented 8 years ago

Could you help to delete line 65 to line 82, to see whether it helps?