Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.79k stars 2.93k forks source link

Client sends double traffic to workers when set alluxio.user.file.writetype.default=CACHE_THROUGH #18449

Open Haoning-Sun opened 9 months ago

Haoning-Sun commented 9 months ago

Alluxio Version: Alluxio-2.9

Describe the bug

Client sended double traffic to workers when set alluxio.user.file.writetype.default to CACHE_THROUGH. The data were written into the alloxio worker and ufs, as shown below.

client -> worker block client -> worker -> ufs

Instead of sending data once to write to workers and UFS like below.

client -> worker/worker block -> ufs

To Reproduce

Expected behavior Determine whether to write UFS when writing data to the worker for the first time, rather than sending data to the worker again by the client and then writing UFS.

Haoning-Sun commented 9 months ago

When the client writes data to the worker, it is first written to the worker's alluxio block by mCurrentBlockOutStream, and then the data sent to the worker by mUnderStorageOutputStream is written to ufs. On the worker side it is handled by BlockWriteHandler and UfsFileWriteHandler respectively. image image

Haoning-Sun commented 9 months ago

The reason for sending two copies of the data to the worker is probably know, because alluxio client to the worker is to write block, each block corresponds to a stream, and write hdfs can only be used with the same stream, that is, can not be processed at the same time, and the blocks of a file may be distributed in different workers, so that can not write hdfs file through a same client, so there is now the implementation.