exasol / cloud-storage-extension

Exasol Cloud Storage Extension for accessing formatted data Avro, Orc and Parquet, on public cloud storage systems
MIT License
7 stars 11 forks source link

Bug when reading from Azure Data Lake Gen2 with delta format #195

Closed morazow closed 2 years ago

morazow commented 2 years ago

Situation

We get the following error when reading from Azure Data Lake Gen2 storage using delta format.

VM error: F-UDF-CL-LIB-1127: F-UDF-CL-SL-JAVA-1002: F-UDF-CL-SL-JAVA-1013:
com.exasol.ExaUDFException: F-UDF-CL-SL-JAVA-1080: Exception during run
com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2053)
com.google.common.cache.LocalCache.get(LocalCache.java:3966)
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4863)
org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:562)
org.apache.spark.sql.delta.DeltaLog$.forTable(DeltaLog.scala:453)
com.exasol.cloudetl.bucket.Bucket.getPathsFromDeltaLog(Bucket.scala:85)
com.exasol.cloudetl.bucket.Bucket.getPaths(Bucket.scala:78)
com.exasol.cloudetl.emitter.FilesMetadataEmitter.<init>(FilesMetadataEmitter.scala:27)
com.exasol.cloudetl.scriptclasses.FilesMetadataReader$.run(FilesMetadataReader.scala:31)
com.exasol.cloudetl.scriptclasses.FilesMetadataReader.run(FilesMetadataReader.scala...

It is because of the excluded org.codehaus.jackson:jackson-mapper-asl:1.9.13 dependency. An the replacement com.fasterxml.jackson.core:jackson-databind:2.13.1 is not used.

Acceptance Criteria

morazow commented 2 years ago

I have looked into this further.

The issue is with Hadoop Azure library that depends on the old jackson dependency.

172.21.0.2:54518> Caused by: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.parseListFilesResponse(AbfsHttpOperation.java:528)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:391)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:290)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:217)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:191)
172.21.0.2:54518> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:464)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:189)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:302)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:1054)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:1024)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:1006)
172.21.0.2:54518> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:490)
172.21.0.2:54518> org.apache.spark.sql.delta.storage.HadoopFileSystemLogStore.listFrom(HadoopFileSystemLogStore.scala:83)
172.21.0.2:54518> org.apache.spark.sql.delta.SnapshotManagement.listFrom(SnapshotManagement.scala:62)
172.21.0.2:54518> org.apache.spark.sql.delta.SnapshotManagement.listFrom$(SnapshotManagement.scala:61)
172.21.0.2:54518> org.apache.spark.sql.delta.DeltaLog.listFrom(DeltaLog.scala:62)

There is an effort to replace older Jackson versions HADOOP-16908 (corresponding pull request PR 3789). But this will be included in the next 3.4.0 version.

For now, we are going to include org.codehaus.jackson:jackson-mapper-asl:1.9.13 and suppress vulnerabilities.


Import query to reproduce above exception:

IMPORT INTO TEST.TEST1
FROM SCRIPT CLOUD_STORAGE_EXTENSION.IMPORT_PATH WITH
  BUCKET_PATH     = 'abfss://container@storageaccount.dfs.core.windows.net/2m5/*'
  DATA_FORMAT     = 'DELTA'
  CONNECTION_NAME = 'AZURE_ABFS_CONNECTION'
  TRUNCATE_STRING = 'true'
  PARALLELISM     = 'nproc()*2';