Mellanox / UDA

Unstructured Data Accelerator (RDMA) for Hadoop MapReduce
10 stars 5 forks source link

Reduce meet "flush err' #19

Closed goldjay1231 closed 8 years ago

goldjay1231 commented 8 years ago

Hi,

When I run terasort 1TB, I meet below error message in reduce's log. It lead job hanged.

2016-04-25 11:17:46,904 ERROR [Thread-10] org.apache.hadoop.mapred.ShuffleConsumerPlugin: Operation: IBV_WC_RECV (128). Dev 0x7f0865993690 wr (0x9a3058) flush err. quitting... (DataNet/RDMAClient.cc:155)

Please kindly provides any suggestions, thanks~

dinal commented 8 years ago

It looks like your reducer didn't get responses and thus causing the retry counter to exceed its max tries. The problem is probably on the server side, are there any other errors in the system? do you see all Datanodes up and running?

goldjay1231 commented 8 years ago

Thank for your response.

Yes, Datanodes are up and running.

When I disable uda, it can run Terasort successful. I use LACP bonding (Mellanox MCX312B-XCCT 10G dual port) with two Mellanox SX1024 (enable MLAG mode). Do I miss some thing?

dinal commented 8 years ago

yes, uda doesn't support bonding

goldjay1231 commented 8 years ago

Thanks,

But follow link https://community.mellanox.com/docs/DOC-1531

Running UDA (MapReduce acceleration) over a bonded (active/passive) interface will work. However, the current UDA (RDMA RC QP) does not implement any reconnect of fail-over logic to support any bonding events.

Or just UDA doesn't support LACP(802.3ad) bonding mode?

dinal commented 8 years ago

In general UDA doesn't support bonding (none of the modes) Active-backup works since it basically behaves like a regular interface if no reconnect events happen.