Closed yjshen closed 10 months ago
It would be better to add a consumer to consume the minimum part of the cached file continuously and then append to the HDFS file to reduce the memory pressure.
To achieve this, we can add a current minimum part index in the HdfsMultiPartUpload
. When invoking put_multipart_part
, we can check whether this part is for the minimum part. If so, it can notify the consumer. The consumer will consume the minimum part stored in the cache and update the current minimum part index. Continue this step until there's no cached part for the current minimum part index.
Fixing the comments in the other PR: #17, close this.
Discovered the remarkable pull request https://github.com/apache/arrow-datafusion/pull/6987, which enables writing data through the Object Store API with AsyncWriter.
We can support writing directly to HDFS once we add support for the
put_multipart
andabort_multipart
APIs.