Open daviddwlee84 opened 8 months ago
Hi @daviddwlee84, we're open to supporting HDFS as a provider. None of the core developers use it regularly, so it would be great to have it from a contributor.
A few questions:
Otherwise, implementation will be creating the HdfsClient
and HdfsPath
like for the existing providers, a test rig, a mocked backend for unit testing, and any provider specific tests.
Thanks @pjbull, I see.
For the first question, as far as I know, pyarrow.fs.HadoopFileSystem
might be a good choice. (Some other libraries are just wrapping up Hadoop CLI which requires complex environment setup and version matching.)
Here is an example of how fsspec
manipulates this.
For the second question, I haven't used Hadoop within the container before. I found this might be usable but not very active. big-data-europe/docker-hadoop: Apache Hadoop docker image
For single-node deployment, this requires a matched Java & Hadoop version. With minimal configuration on $HADOOP_HOME/etc/hadoop/hdfs-site.xml
like
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///nvme/HDFS/HadoopName</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///nvme/HDFS/HadoopData</value>
</property>
</configuration>
Then can start with $HADOOP_HOME/sbin/start-dfs.sh
.
I think these installation steps can be done in a Dockerfile easily.
I found that Cloudpathlib only supports prefixes with
['az://', 's3://', 'gs://']
currently. Is there any plan to support HDFS pathhdfs://
in the future?