HDFS support for catalog

HDFS support for catalog

Description

If possible, I would like to be able to work both with a remote Hadoop cluster, and S3 storage while using Spark in a Jupyter Notebook. This include the possibility of:

Writing in and reading from S3
Writing in and reading from HDFS
Writing in S3 and reading from HDFS
Writing in HDFS and reading from S3

How we currently do it

The following steps explain how we achieved HDFS connection with S3 using Onyxia, Spark, and Jupyter;

We are using an external Hadoop cluster, which necessitates only a password to connect to.

Note: Keytabs are also widely used, but the detail is not explicitely written in this document. The Keytab would be mounted to the Jupyter pod, allowing one to kinit using this Keytab in the UI

Step 1 - Get configuration files from Hadoop cluster

We need the following configuration file from the Hadoop cluster, located in an edge node:

core-site.xml located in /etc/hadoop/conf;
hdfs-site.xml located in /etc/hadoop/conf;
hive-site.xml located in /etc/hive/conf;
krb5.conf located in /etc/krb5.conf.

Step 2 - S3 credentials

The aforementioned designated files are lacking S3 credentials, which are required for the Jupyter notebook to work with S3 storage. They are found in Onyxia: My Account/Connect to storage

We append these lines to the Hadoop core-site.xml, while replacing $S3 fields by Onyxia's S3 credentials:

<property>
    <name>fs.s3a.access.key</name>
    <value>$S3_ACCESS_KEY</value>
</property>

<property>
    <name>fs.s3a.secret.key</name>
    <value>$S3_SECRET_KEY</value>
</property>

<property>
    <name>fs.s3a.session.token</name>
    <value>$S3_SESSION_TOKEN</value>
</property>

<property>
    <name>fs.s3a.endpoint</name>
    <value>minio.demo.insee.io</value>
</property>

<property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>true</value>
</property>

<property>
    <name>fs.s3a.aws.credentials.provider</name>
    <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
</property>

<property>
    <name>trino.s3.credentials-provider</name>
    <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
</property>

<property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
</property>

Step 3 - Upload files to cluster

You must have access to these files within your cluster, to create config maps with them.

Step 4 - Modify Stateful set

To give access to Jupyter's pods to the configuration, we need a Kubernetes configMap for each file, and to mount them into the pods. First, we create these configMaps. Set a variable with your Onyxia account username (to access the correct namespace):

ONYXIA_USER=<user>

Considering that in the region, the namespace is formed as u-username, from one of the master nodes of the Kubernetes cluster, run the following command:

kubectl create configmap core-site -n u-$ONYXIA_USER --from-file=core-site.xml
kubectl create configmap hdfs-site -n u-$ONYXIA_USER --from-file=hdfs-site.xml
kubectl create configmap hive-site -n u-$ONYXIA_USER --from-file=hive-site.xml
kubectl create configmap krb5 -n u-$ONYXIA_USER --from-file=krb5.conf
KUBE_EDITOR="nano" kubectl edit statefulset $(kubectl get statefulset -n u-$ONYXIA_USER | grep jupyter |  awk -F:  '{ print $1 }' | awk -F" "  '{ print $1 }') -n u-$ONYXIA_USER

In nano, volumeMounts and volumes fields need to be modified. Delete the existing config-coresite and config-hivesite configMap in volumes and add the following in the respective fields:

volumeMounts:
  - mountPath: /opt/hadoop/etc/hadoop/hive-site.xml
    name: config-hivesite
    subPath: hive-site.xml
  - mountPath: /opt/hadoop/etc/hadoop/hdfs-site.xml
    name: config-hdfssite
    subPath: hdfs-site.xml
  - mountPath: /etc/krb5.conf
    name: krb5
    subPath: krb5.conf
volumes:
  - configMap:
      defaultMode: 420
      name: core-site
    name: config-coresite
  - configMap:
      defaultMode: 420
      name: hive-site
    name: config-hivesite
  - configMap:
      defaultMode: 420
      name: hdfs-site
    name: config-hdfssite
  - configMap:
      defaultMode: 420
      name: krb5
    name: krb5

Step 5 - Authenticate with Kerberos

We authenticate through the Kerberos of the Hadoop cluster with account credentials using kinit -E inside of Jupyter UI, in a terminal. The -E option allows to select a specific user to authenticate with.

kinit -E <user>

If the cluster requires a Keytab, just pass the keytab like you would usually do.

Once you enter your password, you are able to run jobs in Spark through the Jupyter UI with HDFS.

Step 6 - Certificates if custom certificate authority

For S3 compatibility, we need to make sure that the certificates are accepted by the pod. Using nano, create a file named ca.pem and paste the contents of the certificate authority (or linked chain).

In sudo su, do:

cat ca.pem >> /etc/ssl/certs/ca-certificates.crt

Then, in sudo su jovyan, do:

sudo cp ca.pem /usr/local/share/ca-certificates/ca.crt
sudo update-ca-certificates

Once these steps are done, it should be good to go.

Possible implementation

Allow a user to pass the mentioned xxx-site.xml files, as well as the krb5.conf at the initialization of the pod, in My Services.
Passing core-file.xml, the file should be merged together with the one that is at the moment present in the Onyxia configuration, containing all S3 credentials, and overwriting conflicting fields (if it is the case).
Allow a user to mount certificate authority to connect to the minio.demo.insee.io endpoint
Change the statefulSet of the targeted pod linked to the service with: the certificates, the files (adding configmaps as it was the case before)

Then, the user only would have to kinit, and start playing around with the Notebook.

InseeFrLab / legacy-onyxia-entrypoint

[Feature Request] HDFS support for catalog (Spark, Jupyter, ...) #45