JuliaParallel / Elly.jl

Hadoop HDFS and Yarn client
Other
46 stars 15 forks source link

Support for wasb:// protocol on Azure HDInsight #35

Open aviks opened 6 years ago

aviks commented 6 years ago

I can see the files using hadoop fs -ls but not using readdir. Trying to create a file reference for a file I know to exist using HDSFFile and then stat shows Elly.HDFSException("Path not found")

sshuser@hn0-myclust:~$ hadoop fs -ls /
Found 15 items
drwxr-xr-x   - root   supergroup          0 2018-02-07 14:25 /HdiSamples
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /ams
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /amshbase
drwxrwxrwx   - yarn   hadoop              0 2018-02-07 14:15 /app-logs
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /apps
drwxr-xr-x   - yarn   hadoop              0 2018-02-07 14:15 /atshistory
drwxr-xr-x   - root   supergroup          0 2018-02-07 14:24 /custom-scriptaction-logs
drwxr-xr-x   - root   supergroup          0 2018-02-07 14:25 /example
drwxr-xr-x   - hbase  supergroup          0 2018-02-07 14:15 /hbase
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /hdp
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /hive
drwxr-xr-x   - mapred supergroup          0 2018-02-07 14:15 /mapred
drwxrwxrwx   - mapred hadoop              0 2018-02-07 14:15 /mr-history
drwxrwxrwx   - hdfs   supergroup          0 2018-02-07 14:15 /tmp
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /user
sshuser@hn0-myclust:~$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Elly

julia> dfs = HDFSClient("hn0-myclust.3p0iyjauoc2e3faws152r5tm0e.cx.internal.cloudapp.net", 8020)
HDFSClient: sshuser@hn0-myclust.3p0iyjauoc2e3faws152r5tm0e.cx.internal.cloudapp.net:8020/
    id: 76ba6c80-1ac9-45
    connected: false
    pwd: /

julia> readdir(dfs)
1-element Array{AbstractString,1}:
 "tmp"
aviks commented 6 years ago

[Renamed the issue]

So this is due to the fact that Azure uses a separate wasb:// protocol layered over hdfs://, which uses azure blob store as the underlying storage. This will probably need to be supported explicitly within Elly.

Some background: https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/

aviks commented 6 years ago

Similarly, HDInsight supports the adl:// protocol that uses Azure Data Lake Store as the underlying storage engine for hadoop. Would be good to support that as well.

tanmaykm commented 6 years ago

related:

tanmaykm commented 4 years ago

Looks like this wasb support came in with Hadoop v2.9: https://hadoop.apache.org/docs/r2.9.0/hadoop-azure/index.html#Introduction

But what is not clear yet to me is whether the server will transparently wrap wasb and present a hdfs interface. If that is true then we should be able to access wasb by just upgrading Elly to use v2.9 protobuf apis. But I am still unsure how/why that would work. Will dig a bit deeper.

tanmaykm commented 4 years ago

This looks like being entirely implemented as a client library - see org/apache/hadoop/fs/azure/NativeAzureFileSystem.html source.

It seems to be reading the hdfs config, but it interacts with azure services directly. The hdfs namenode and datanodes do not seem to be aware of this at all.

So, the implementation of HDFSFile in Elly.jl can cater only to hdfs:// filesystem. And we probably need to look at Azure apis to do an implementation of NativeAzureFile on similar lines in Julia. Also there doesn't seem to be any direct Azure API for this (wasb) filesystem protocol, only APIs for blobstore. We will need to implement the filesystem metadata management in Julia as well.