apache-spark-on-k8s / kubernetes-HDFS

Repository holding configuration files for running an HDFS cluster in Kubernetes
Apache License 2.0
397 stars 185 forks source link

Support kerberized secure Hadoop. #23

Closed kimoonkim closed 7 years ago

kimoonkim commented 7 years ago

Vanilla Hadoop is not secure without Kerberos. Intruders can easily use custom client code to put false user names and steal or destroy data.

I and sending a PR soon. The main design ideas are:

  1. Mount krb5.conf as a k8s config map on the namenode and datanode pods.
  2. Specify a number of Hadoop security config keys in the charts.
  3. Also mount a K8s secret containing keytab files on an init-container, which will then copy the desired keytab file in place.
  4. For datanodes, we need jsvc to bind to privileged ports. Use k8s initContainer to copy jsvc in place.

We use a K8s secret for (3) so it is easy to apply access control using RBAC. A secret has size limit of 1 MB and each keytab file is less than 1 KB. So a secret can hold up to 1000 keytab files. If the cluster is larger, then we can switch to a PVC backed by NFS, etc.

For more details as to how-to, see README.md files in the PR. Please feel free to ask questions or make suggestions for the design approach.

The prototype seems to work :-) From the namenode log:

$ kubectl logs hdfs-namenode-0 ... The reported blocks 651 has reached the threshold 0.9990 of total blocks 651. The number of live datanodes 10 has reached the minimum number 0. In safe mode extension. Safe mode will be turned off automatically in 9 seconds. 17/08/29 21:40:19 INFO hdfs.StateChange: STATE Leaving safe mode after 55 secs 17/08/29 21:40:19 INFO hdfs.StateChange: STATE Safe mode is OFF 17/08/29 21:40:19 INFO hdfs.StateChange: STATE* Network topology has 1 racks and 10 datanodes

I was also able to run Spark jobs against this secure HDFS, using apache-spark-on-k8s PR #414 that has secure Hadoop support.

$ java -cp /usr/local/hadoop/conf/:'/usr/local/spark-on-k8s/jars/*' org.apache.hadoop.fs.FsShell -rm -R -f hdfs://kube-n1:8020/tmp/hosts-kimoonkim 2017-08-29 14:42:26 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-08-29 14:42:27 INFO TrashPolicyDefault:92 - Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted hdfs://kube-n1:8020/tmp/hosts-kimoonkim

$ HADOOP_CONF_DIR=/usr/local/hadoop/conf /usr/local/spark-on-k8s/bin/spark-submit --class org.apache.spark.examples.DFSReadWriteTest --conf spark.app.name=spark-dfsrw --conf spark.hadoop.fs.defaultFS=hdfs://kube-n1:8020 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.kerberos=true local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.3.0-SNAPSHOT.jar /etc/hosts /tmp/hosts-kimoonkim ... 17/08/29 14:57:30 INFO LoggingPodStatusWatcherImpl: Container final statuses:

 Container name: spark-kubernetes-driver
 Container image: docker:5000/spark-driver:custom-kimoon-0811-6
 Container state: Terminated
 Exit code: 0

17/08/29 14:57:30 INFO Client: Application spark-dfsrw finished.

From the driver log:

$ kubectl logs spark-dfsrw-1504043831581-driver Success! Local Word Count (20) and DFS Word Count (20) agree.

The core-site.xml and hdfs-site.xml in /usr/local/hadoop/conf had the following content:


$ cat /usr/local/hadoop/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>hadoop.security.authentication</name>
    <value>kerberos</value>
  </property>
  <property>
    <name>hadoop.security.authorization</name>
    <value>true</value>
  </property>
  <property>
    <name>hadoop.rpc.protection</name>
    <value>privacy</value>
  </property>
</configuration>

$ cat /usr/local/hadoop/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020</value>
  </property>
  <property>
    <name>dfs.namenode.kerberos.principal</name>
    <value>hdfs/kube-n1.pepperdata.com@PEPPERDATA.COM</value>
  </property>
</configuration>

@ifilonenko @liyinan926 @foxish

erikerlandson commented 7 years ago

As you mentioned above, use of hostPath isn't ideal. Is the problem installing a secret for each node?

kimoonkim commented 7 years ago

I can imagine creating a secret per node using a unique name. Perhaps including the cluster node names in secret names.

The problem is more of how the k8s DaemonSet yaml file refers to different secret name per node. AFAIK, there is no macro mechanism for expanding a common name in k8s resource file to multiple different names depending on which cluster node each daemon set pod lands on. I was hoping helm chart may have something, but it doesn't seem so.

Anyone knows better? :-)

liyinan926 commented 7 years ago

@kimoonkim can you elaborate on what you meant by keytab files on different cluster nodes should have different passwords as file content? Why different nodes should have different keytab files?

foxish commented 7 years ago

Question - what happens when the pod host-name changes because an hdfs-data-node is killed and respawned? Would that make the KDC reject the newly spawned hdfs-data-node?

If you have access to a 1.7.x cluster, can you try using the local PVs? They're alpha, but if we need specific features to support use-cases like this, now would be the time to find those out.

kimoonkim commented 7 years ago

@kimoonkim can you elaborate on what you meant by keytab files on different cluster nodes should have different passwords as file content? Why different nodes should have different keytab files?

Sure. In Kerberos, it is the convention that each and every service daemon instance on a different host has a unique principal account by including the hostname into the principal name. e.g. Datanode 1 on hostA will have hdfs@hostA/MYREALM as the account, and datanode 2 on hostB will have hdfs@hostB/MYREALM. Since they are different accounts, they are supposed to have different passwords. This way, when a single keytab file is compromised by an intruder, you will only lose data managed by the particular daemon instance. (Or, the particular daemon instance can become a phishing server) In contrast, if you use a single account and password for all daemon instances, you can lose all data across cluster nodes.

kimoonkim commented 7 years ago

Question - what happens when the pod host-name changes because an hdfs-data-node is killed and respawned? Would that make the KDC reject the newly spawned hdfs-data-node?

Datanodes use hostNetwork. So they are not associated with the pod names as identities. Instead, they identify with the underlying cluster node host names. When a datanode pod is recreated, it will still use the correct cluster node host name in the principal and so won't be rejected by KDC.

kimoonkim commented 7 years ago

If you have access to a 1.7.x cluster, can you try using the local PVs? They're alpha, but if we need specific features to support use-cases like this, now would be the time to find those out.

I like the suggestion. I think using local PVs is a better solution in terms of access control. I'd love to try it out. I didn't know 1.7 has it as alpha. Thanks!

kimoonkim commented 7 years ago

@erikerlandson @liyinan926 @foxish

Instead of using hostPath for the keytab files, I wonder if we can mount a single K8s secret or PVC as the user specifies. The single secret or PVC will be shared by the namenode and all the datanode pods. It will have all the keytab files stored in it, named after the cluster node host names.

Then each pod will use an init container to copy the keytab file matching its cluster node host name to the right path that the daemon is expecting.

K8s secret has 1 MB size limit. I checked a few keytab files. Their size is ~ 1 KB. So a single secret can support up to 1000 keytab files. If your cluster is larger, then you can switch to PVC. One could use whatever PV for this. I plan to test this using NFS PV.

A nice thing about this approach is that access control of secret or PVC like using RBAC is more streamlined than restricting hostPath privilege.

What do you think? Any concerns?

liyinan926 commented 7 years ago

I personally like the idea of using a secret. The only concern I have is once the secret gets mounted to a pod, it makes every keytab files locally available to the pod. We can definitely set copy the right keytab file out and set HADOOP_TOKEN_FILE_LOCATION to point to it. But what about the rest keytab files, should we delete them from the mount path?

kimoonkim commented 7 years ago

Since this is a single shared secret or PV, deleting other keytab files may affect other pods.

Ideally, each pod would want to unmount the volume after it copied the right keytab file. That might not be easy or require non-default service account.

Fundamentally, this approach assumes that the shared secret and HDFS pods are well guarded by RBAC. Otherwise, you end up exposing keytab files. I understand risking exposure of only one keytab at a time is better than otherwise. But it's likely cluster admin would do set up RBAC for all of them in the right way.

liyinan926 commented 7 years ago

Sorry for the confusion. I meant deleting the locally mounted key tab files. This will not change the secret itself.

kimoonkim commented 7 years ago

@liyinan926 Would it be even possible deleting only locally mounted files without affecting the secret itself? Maybe the pod will get access error? What about PVC volume? Is it possible to delete only the locally mounted files without affecting the remote files?

liyinan926 commented 7 years ago

@foxish correct me if I'm wrong. I believe it's possible to delete locally mounted files (stored in memory locally) without affecting the secret itself, unless the secret object gets updated through the api server.

kimoonkim commented 7 years ago

@liyinan926 I think we can mount the secret only in the init-container, because volumeMounts is container-specific. (It is spec.containers[].volumeMounts[])

Then the init-container will copy the matching keytab file to a mount path of an emptyDir volume that is mounted both by the init-container and the datanode container. After the init-container is done, the secret will be unmounted automatically.

I think #24 is already doing something similar for copying jsvc to the datanode container using emptyDir.

kimoonkim commented 7 years ago

Switched #24 to the secret/init-container approach and it seems to work! Thanks @erikerlandson, @foxish and @liyinan926 for inspirations.

liyinan926 commented 7 years ago

@kimoonkim Cool! Did you try deleting unwanted key tab files?

kimoonkim commented 7 years ago

No, we didn't need to. We mount the secret only in the init-container. After the init-container finishes, the other keytab files will be unmounted and disappear from the pod. So no need to delete them. See the previous comment or the latest commit for details.

liyinan926 commented 7 years ago

Ah, I didn't see it. This is great!