Closed goldjay1231 closed 8 years ago
It looks like your reducer didn't get responses and thus causing the retry counter to exceed its max tries. The problem is probably on the server side, are there any other errors in the system? do you see all Datanodes up and running?
Thank for your response.
Yes, Datanodes are up and running.
When I disable uda, it can run Terasort successful. I use LACP bonding (Mellanox MCX312B-XCCT 10G dual port) with two Mellanox SX1024 (enable MLAG mode). Do I miss some thing?
yes, uda doesn't support bonding
Thanks,
But follow link https://community.mellanox.com/docs/DOC-1531
Running UDA (MapReduce acceleration) over a bonded (active/passive) interface will work. However, the current UDA (RDMA RC QP) does not implement any reconnect of fail-over logic to support any bonding events.
Or just UDA doesn't support LACP(802.3ad) bonding mode?
In general UDA doesn't support bonding (none of the modes) Active-backup works since it basically behaves like a regular interface if no reconnect events happen.
Hi,
When I run terasort 1TB, I meet below error message in reduce's log. It lead job hanged.
2016-04-25 11:17:46,904 ERROR [Thread-10] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Operation: IBV_WC_RECV (128). Dev 0x7f0865993690 wr (0x9a3058) flush err. quitting... (DataNet/RDMAClient.cc:155)
Please kindly provides any suggestions, thanks~