datafusion-contrib / datafusion-objectstore-hdfs

HDFS based on Java implementation as a remote ObjectStore for DataFusion
Apache License 2.0
8 stars 8 forks source link

feat: add support for AsyncWrite #16

Closed yjshen closed 10 months ago

yjshen commented 11 months ago

Discovered the remarkable pull request https://github.com/apache/arrow-datafusion/pull/6987, which enables writing data through the Object Store API with AsyncWriter.

We can support writing directly to HDFS once we add support for the put_multipart and abort_multipart APIs.

yahoNanJing commented 11 months ago

It would be better to add a consumer to consume the minimum part of the cached file continuously and then append to the HDFS file to reduce the memory pressure.

To achieve this, we can add a current minimum part index in the HdfsMultiPartUpload. When invoking put_multipart_part, we can check whether this part is for the minimum part. If so, it can notify the consumer. The consumer will consume the minimum part stored in the cache and update the current minimum part index. Continue this step until there's no cached part for the current minimum part index.

yjshen commented 10 months ago

Fixing the comments in the other PR: #17, close this.