Data pipeline failed without retry

zuston commented 2 weeks ago

When using the hdfs-native crate, I encountered the data pipeline failed. If this happened by the problem datanode, do we need to request a new block location?

Could you help look this problem? @Kimahriman

zuston commented 2 weeks ago

After digging the vannila hdfs code, it looks it will replace bad datanode if pipeline failed.

Kimahriman commented 2 weeks ago

Yeah the write path is the least resilient part of the library right now. I need to take a deeper look to understand exactly how the Java writer handles lost data nodes, and then figure out how I could write a test for something like that.

If you're not already, I'd currently recommend retrying the whole write from scratch if you still have access to all the data you are trying to write.

Kimahriman commented 2 weeks ago

Based on a quick look, it seems like the logic is generally:

If a DataNode fails, either the one it is talking to or any of the others being replicated to, simply remove that DataNode from the block being written and continue on
If certain conditions are true, add a new DataNode to the block. Specifically the default conditions are in this comment:
```
DEFAULT condition:
*   Let r be the replication number.
*   Let n be the number of existing datanodes.
*   Add a new datanode only if r >= 3 and either
*   (1) floor(r/2) >= n or (2) the block is hflushed/appended.
```
So for example with replication of 3, a new DataNode would only get added if two DataNodes fail during writing that block.

The first one is simpler and I could look into trying to add that, though still not exactly sure how I would test it. That should make things more resilient than the current behavior which is simply failing on the first DataNode error.

The second is a bit more complex and requires copying the already written data to a new replica before continuing, and probably wouldn't be something I could tackle anytime soon.

zuston commented 2 weeks ago

Thanks for your quick reply, I found another error status when appending file. Please see this:

And I think the first one solution is good for my append case. @Kimahriman

zuston commented 2 weeks ago

Yeah the write path is the least resilient part of the library right now.

Yes, it looks unstable especially in a busy hdfs cluster.

If you have any improvement patch, i’m happy to test this. Almostly, i could reproduce it by my cases.

Kimahriman commented 2 weeks ago

Thanks for your quick reply, I found another error status when appending file. Please see this:

This would be when a data node the block is replicating to fails. The connection drop would be when the one you are talking to dies, so I think they're effectively the same issue

zuston commented 2 weeks ago

After simply looking this article https://blog.cloudera.com/understanding-hdfs-recovery-processes-part-2/, I think the bad data node replacement will not trigger the original data resend. Once client found the bad data node, it will build the new pipeline through the namenode, and then resend the data start from last acked packet. @Kimahriman

Kimahriman commented 2 weeks ago

Very helpful article! I hoped that might be the case for node replacement but then found https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1550 while tracing through the code. The data is re-replicated by the client on node replacement. It can only keep going from where it left off if the node isn't replaced.

zuston commented 2 weeks ago

Thanks for your code reference. Let me take a deep look.

zuston commented 2 weeks ago

The data is re-replicated by the client on node replacement.

From digging the DataStreamer.java, the replicated transfer is from the healthy datanode -> replacement datanodes. The client only trigger the recovery pipeline creating to back fill the data from healthy node to replacement node.

BTW, if ignore the bad nodes and do nothing, the missing blocks will exist. So If we want to have a resilient writing process, the datanode replacement may need to be supported

zuston commented 2 weeks ago

Attach the hadoop code reference: https://github.com/apache/hadoop/blob/96572764921706b1fecaf064490457d36d73ea6e/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L3593

Kimahriman commented 1 week ago

From digging the DataStreamer.java, the replicated transfer is from the healthy datanode -> replacement datanodes. The client only trigger the recovery pipeline creating to back fill the data from healthy node to replacement node.

Yeah looks like it just makes an RPC call to trigger the replication, just a little bit of extra complexity on there. I think that would still be follow on work to just getting the write to continue with fewer DataNodes on failures.

BTW, if ignore the bad nodes and do nothing, the missing blocks will exist. So If we want to have a resilient writing process, the datanode replacement may need to be supported

My understanding is even if a single node (or multiple but not all) in a pipeline are lost, if you successfully write the block to at least one DataNode, HDFS will re-replicate it after the fact as it will see it is under replicated. I think the main reason for replicating as part of the pipeline recovery is just to make that a little more resilient. i.e. if in the end all but one DataNodes fail in your pipeline, and you "successfully" finish your write, but then that one DataNode immediately dies, your write was successful but now you are missing that block.

zuston commented 1 week ago

My understanding is even if a single node (or multiple but not all) in a pipeline are lost, if you successfully write the block to at least one DataNode, HDFS will re-replicate it after the fact as it will see it is under replicated.

Got it. If so, the simply ignore bad data node in the pipeline is acceptable. We only just need to recognize the bad node from the ack response’s reply and flag and then ignore this.

zuston commented 1 week ago

another possible error message

zuston commented 1 week ago

Kimahriman commented 1 week ago

Hmm that's very odd, might be unrelated to writing data. What version of HDFS are you running? Also if you can replicate those having some debug logs would be helpful to see what's going on.

zuston commented 1 week ago

Hmm that's very odd, might be unrelated to writing data. What version of HDFS are you running? Also if you can replicate those having some debug logs would be helpful to see what's going on.

HDFS with 3.2.2 version.

Also if you can replicate those having some debug logs would be helpful to see what's going on.

It's OK, I will attach it if I having more detailed logs from the namenode or datanodes.

Kimahriman / hdfs-native

Data pipeline failed without retry #154