Kimahriman / hdfs-native

Apache License 2.0
32 stars 12 forks source link

bug: Channel closed while waiting for next ack #49

Closed zuston closed 10 months ago

zuston commented 10 months ago

image

zuston commented 10 months ago

And when using the append, the error logs as follows

image

zuston commented 10 months ago

And the code is here: https://github.com/zuston/incubator-uniffle/blob/hdfs-native/rust/experimental/server/src/store/hdfs.rs

Kimahriman commented 10 months ago

Thanks for this, the write path isn't currently very resilient to failures and definitely could use some improvements. I wanted to get a bare minimum write path working, but there's all kinds of situations the DFSOutputStream tries to handle. Some info that would be useful:

zuston commented 10 months ago

Does this happen deterministically or was this a random error you ran into?

deterministically

Any stats about the size of the file you're appending to or how many replicas.

Append to a empty file and 3 replica

we use hadoop 3.2.1 and enable kerberos, HA name ode and RBF

Kimahriman commented 10 months ago

Append to a empty file and 3 replica

What do you mean by append to empty file? Does the file already exist with no size and you open for appending? Or does the file not exist yet?

zuston commented 10 months ago

Following the steps

  1. using client.create(file_path, WriteOptions::default()).await?.close().await?; create a empty file
  2. and then append the data into this file. self.client.append(file_path).await?.write(data).await?
Kimahriman commented 10 months ago

Can you run with debug rust logs and share the output? It should include all RPC messages being sent with the namenode. That exact thing is part of the integration test so not sure why it's holding onto the lease still

Kimahriman commented 10 months ago

Actually I took a look at your code again. Is this function: https://github.com/zuston/incubator-uniffle/blob/hdfs-native/rust/experimental/server/src/store/hdfs.rs#L291 missing a close after appending? That could cause the issue you're seeing, if you're doing multiple appends.

zuston commented 10 months ago

https://github.com/zuston/incubator-uniffle/blob/hdfs-native/rust/experimental/server/src/store/hdfs.rs#L291

Oh. yes. I don't close this client.

Do you mean the every append operation should be closed? And is the client thread safe for multiple appending at the same time?

Kimahriman commented 10 months ago

Do you mean the every append operation should be closed? And is the client thread safe for multiple appending at the same time?

append opens a writer in append mode. If you keep the writer around you can write to it multiple times, but you need to close it when you're done. There's no way currently to guarantee a read will see the data until you close.

Not sure exactly what you're asking if it's thread safe. You can't open the same file for append multiple times. You could share a writer across threads behind a mutex

zuston commented 10 months ago

append opens a writer in append mode. If you keep the writer around you can write to it multiple times, but you need to close it when you're done. There's no way currently to guarantee a read will see the data until you close.

Got it.

Not sure exactly what you're asking if it's thread safe. You can't open the same file for append multiple times. You could share a writer across threads behind a mutex

The client should be shared in multiple threads to use append different files at the same time, right?

Kimahriman commented 10 months ago

The client should be shared in multiple threads to use append different files at the same time, right?

Ah yes that is the case!

zuston commented 10 months ago

Thanks @Kimahriman Let me take a try!

zuston commented 10 months ago

Now, it works.

Kimahriman commented 10 months ago

Awesome! Maybe I'll add a warning if the file writer is dropped without closing

zuston commented 10 months ago

Awesome! Maybe I'll add a warning if the file writer is dropped without closing

Maybe it should close implicitly in drop trait

Kimahriman commented 10 months ago

Awesome! Maybe I'll add a warning if the file writer is dropped without closing

Maybe it should close implicitly in drop trait

Yeah that would be nice but it's tricky since it's async. I think the only way would be spawning a fire and forget task without error checking?