Trunk Data Platform is an Open Source, free, Hadoop distribution.
This distribution is built by EDF (French electricity provider) & DGFIP (Tax Office by the French Ministry of Finance), through an association called TOSIT (The Open source I Trust).
TDP is built from Apache projects source code.
The TDP project is composed of multiple repositories:
Each component of TDP also has its own repository.
The following table shows the core components of TDP as well as the Apache branch they were based on and the TDP branch which serves as base for our releases.
Component | Version | Apache Git branch | TDP Git Branch | TDP commits |
---|---|---|---|---|
Apache ZooKeeper | 3.4.6 | release-3.4.6 | XXX | X.X.X |
Apache Hadoop | 3.1.1-0.0 | rel/release-3.1.1 | branch-3.1.1-TDP | compare |
Apache Hive | 3.1.3-1.0 | branch-3.1 | branch-3.1-TDP | compare |
Apache Hive 2 (for Spark 3) | 2.3.9-1.0 | branch-2.3 | branch-2.3-TDP | compare |
Apache Hive 1 (for Spark 2) | 1.2.3-1.0 | branch-1.2 | branch-1.2-TDP | compare |
Apache Tez | 0.9.1-1.0 | branch-0.9.1 | branch-0.9.1-TDP | compare |
Apache Spark | 2.3.4-1.0 | branch-2.3 | branch-2.3-TDP | compare |
Apache Spark 3 | 3.2.2-0.0 | branch-3.2 | branch-3.2-TDP | compare |
Apache Ranger | 2.0.0-1.0 | ranger-2.0 | ranger-2.0-TDP | compare |
Apache Solr (for Ranger) | 7.7.3 | releases/lucene-solr/7.7.3 | XXX | X.X.X |
Apache HBase | 2.1.10-1.0 | branch-2.1 | branch-2.1-TDP | compare |
Apache Phoenix | 5.1.3-1.0 | 5.1 | 5.1.3-TDP | compare |
Apache Phoenix Query Server | 6.0.0-0.0 | 6.0.0 | 6.0.0-TDP | compare |
Apache Knox | 1.6.1-0.0 | v1.6.1 | v1.6.1-TDP | compare |
Apache HBase Connectors | 1.0.0-0.0 | rel/1.0.0 | branch-2.3.4-1.0.0-TDP | compare |
Apache HBase Operator tools | 1.1.0-0.0 | rel/1.1.0 | branch-1.1.0-TDP | compare |
Versions are approximately based on the HDP 3.1.5 release.
Note: For some projects, the Apache foundation maintains sometimes a branch with this the components on which are backported fixes and features. We will be using these branches as much as possible if they are maintained and compatible.
"TDP Extras" carries some projects that cannot be integrated to "TDP Core". There can be different reasons that keep the project outside of the core:
Component | Version | Apache Git branch | TDP Git Branch | TDP commits |
---|---|---|---|---|
Apache ZooKeeper 3.5.9 (for Kafka) | 3.5.9 | release-3.5.9 | XXX | X.X.X |
Apache Kafka | 2.8.2 | 2.8 | 2.8-TDP | compare |
Apache Livy | 0.8.0 | master | branch-0.8.0-TDP | compare |
Apache Airflow | 2.2.2 | 2.2.2 | XXX | X.X.X |
Note: A project can graduate from "TDP Extras" to "TDP Core" if enough people are supporting it and/or if it is made compatible with all the other projects of the stack.
Only bare metal and virtual machine deployment are tested. Container based OS may work but are not guaranteed.
Redhat like OS may work but are not guaranteed.
Every TDP initial release is built from a reference branch on the Apache Git repository according to the above tables. The main change from the original branches is the version declaration in the pom.xml files.
The builds / unit testing of the Maven Java projects of each component above can be run in Kubernetes pods which are scheduled by a Jenkins installation also running on Kubernetes. Kubernetes pods scheduling allows for truly reproducible and isolated builds. Jenkins' strong integration with the Java ecosystem is a perfect match to build the components of the distribution.
Kubernetes was installed on Ubuntu 20.04 Virtual Machines with kubeadm.
Note: It is strongly recommended to deploy a Storage Class in order to have persistence on the Kubernetes cluster (useful for Jenkins among others). In our case, we are using Rook on physical drives attached to the Kubernetes cluster's VMs.
Jenkins is used to trigger the builds which is the same process for every component of the stack:
Jenkins was installed on the Kubernetes cluster with the official jenkinsci Helm chart.
The building environment needs multiple registries:
Nexus Repository OSS can assume all three roles, is free and open source.
Nexus OSS was install on the Kubernetes cluster with the helm chart provided by Oteemo.
It is possible to run a local environment for building / small scale testing.
Prerequisite:
You can start a local building environment with the bin/start-build-env.sh
script.
Note: See build-env/README.md
for details.
To build TDP component binaries, attach to the running tdp-builder
container and git clone
the TDP component repository to it. Each TDP component's tdp/README.md
has custom instructions to launch the build process.
Assign a directory path to the TDP_HOME
variable in the bin/start-build-env.sh
to control the local path of built TDP binaries.