chrislusf / gleam

Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
Apache License 2.0
3.43k stars 290 forks source link

runner to executor grpc connection issue #80

Open carusyte opened 6 years ago

carusyte commented 6 years ago

I quite frequently got the following kind of error messages in one environment and not the other:

runner heartbeat to [::]:16198: runner => executor [::]:16198: rpc error: code = Unavailable desc = grpc: the connection is unavailable
...
runner reportStatus to [::]:44223: runner => executor [::]:44223: rpc error: code = Unavailable desc = grpc: the connection is unavailable

As I look into the code, I found that the grpc connection is closed within 50 ms without knowing if the function has finished its job with the grpc connection. I presume this could be the cause. Would it be safer to call the grpcConection.Close() after the return of fn(client) execution?

gio/runner_grpc_client_to_executor.go:

func withClient(server string, fn func(client pb.GleamExecutorClient) error) error {
    if server == "" {
        return nil
    }

    grpcConection, err := grpc.Dial(server, grpc.WithInsecure())
    if err != nil {
        return fmt.Errorf("executor dial agent: %v", err)
    }
    defer func() {
        time.Sleep(50 * time.Millisecond)
        grpcConection.Close()                  // The connection could possibly be closed prematurely
    }()
    client := pb.NewGleamExecutorClient(grpcConection)

    return fn(client)
}
carusyte commented 6 years ago

I attempted to verify the fix in my environment, but I don't know how to redirect the original gleam package to my fork without breaking all the imports and build a fresh gleam binary using my fork. Working on it...

carusyte commented 6 years ago

Actually it has nothing to do with the defer statement, the closing is assured to be called after the execution of fn anyway... However I've fixed the issue by making some adjustments to the grpc stuff. Please consider the pull request #81 if anybody else is experiencing the same issue.