drivendataorg / cloudpathlib

Python pathlib-style classes for cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
https://cloudpathlib.drivendata.org
MIT License
462 stars 55 forks source link

Will Cloudpathlib support HDFS path? #394

Open daviddwlee84 opened 8 months ago

daviddwlee84 commented 8 months ago

I found that Cloudpathlib only supports prefixes with ['az://', 's3://', 'gs://'] currently. Is there any plan to support HDFS path hdfs:// in the future?

pjbull commented 8 months ago

Hi @daviddwlee84, we're open to supporting HDFS as a provider. None of the core developers use it regularly, so it would be great to have it from a contributor.

A few questions:

Otherwise, implementation will be creating the HdfsClient and HdfsPath like for the existing providers, a test rig, a mocked backend for unit testing, and any provider specific tests.

daviddwlee84 commented 8 months ago

Thanks @pjbull, I see.

For the first question, as far as I know, pyarrow.fs.HadoopFileSystem might be a good choice. (Some other libraries are just wrapping up Hadoop CLI which requires complex environment setup and version matching.)

Here is an example of how fsspec manipulates this.

For the second question, I haven't used Hadoop within the container before. I found this might be usable but not very active. big-data-europe/docker-hadoop: Apache Hadoop docker image

For single-node deployment, this requires a matched Java & Hadoop version. With minimal configuration on $HADOOP_HOME/etc/hadoop/hdfs-site.xml like

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///nvme/HDFS/HadoopName</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///nvme/HDFS/HadoopData</value>
    </property>
</configuration>

Then can start with $HADOOP_HOME/sbin/start-dfs.sh. I think these installation steps can be done in a Dockerfile easily.