Intel-bigdata / Spark-PMoF

Spark Shuffle Optimization with RDMA+AEP
Apache License 2.0
30 stars 22 forks source link

Client connection and RPMP data server connection failure issue #122

Closed PHILO-HE closed 3 years ago

PHILO-HE commented 3 years ago

In one proxy and one data server deployment on my side, all things are normal before any RPMP client request comes. Data server periodically sends heartbeat to proxy as expected. But after client requests data write/read one or more times (put_and_get test is used by me), data server will fail to send heartbeat to proxy. Henceforth, client write/read failure will occur. I found some threads in proxy exit which at least causes no response for heartbeat message from data server.

The below commit is involved in this bug. Please help fix it. Persist data put job status for future potential job recovery. (#118)

Eugene-Mark commented 3 years ago

Fixed by https://github.com/Intel-bigdata/Spark-PMoF/commit/84990651dee9d163705609ed817a369826e47fc5.