gchq / sleeper

A cloud-native, serverless, scalable, cheap key-value store
Apache License 2.0
64 stars 10 forks source link

Minimise Hadoop dependencies #1985

Open patchwork01 opened 8 months ago

patchwork01 commented 8 months ago

Background

We're using Hadoop for our integration with Apache Parquet to store data files, and for bulk import on Apache Spark. In both cases, a lot of dependencies are on the classpath that aren't actually used. They seem to be mainly support for HDFS, which we're not using.

There are a number of dependencies for Hadoop which have vulnerabilities, which have shown up in the dependency checker. We are unable to upgrade several of them because of the versions used by Hadoop. Some of these are only used by parts of Hadoop which we do not use.

This also causes a problem in the size of our deployed jars, as in theory we're past the limit on the code size allowed by AWS Lambda.

This prevented us from instrumenting for OpenTelemetry as our lambdas didn't have space left for the JVM agent:

Description

We'd like to minimise the number of Hadoop dependencies we include in our classpath, and remove support for HDFS as we don't use that.

Analysis

We can update our dependency check suppressions to remove any that are not needed after removing these dependencies.

We're using Hadoop for our integration with Parquet, and for bulk import jobs on Spark.

The Parquet integration only talks to S3 in practice, although the Hadoop integration is designed for HDFS, which may be the reason for several of the dependencies, eg. hadoop-auth.

Spark also seems to only use Hadoop authentication when talking to HDFS: https://spark.apache.org/docs/latest/security.html#kerberos

patchwork01 commented 7 months ago

On hold because we might want to leave this until later if we don't need detailed tracing to test the transaction log state store. We might also prefer to switch the lambdas to deploy as Docker images rather than messing with Hadoop dependencies.