exasol / cloud-storage-extension

Exasol Cloud Storage Extension for accessing formatted data Avro, Orc and Parquet, on public cloud storage systems
MIT License
7 stars 11 forks source link

Check that S3 export with buckets that contain dots #190

Closed morazow closed 2 years ago

morazow commented 2 years ago

Situation

EXPORT test.t1
INTO SCRIPT CLOUD_STORAGE_EXTENSION.EXPORT_PATH WITH
  BUCKET_PATH     = 's3a://exa.test.aws.s3.bucket.etl.01/'
  DATA_FORMAT     = 'PARQUET'
  S3_ENDPOINT     = 's3.eu-west-1.amazonaws.com'
  CONNECTION_NAME = 'S3_CONNECTION'
  PARALLELISM     = 'iproc()'
  OVERWRITE = 'TRUE'
;

Exception:

EXA: EXPORT test.t1...
Error: [22002] VM error: F-UDF-CL-LIB-1126: F-UDF-CL-SL-JAVA-1006: F-UDF-CL-SL-JAVA-1026: 
com.exasol.ExaUDFException: F-UDF-CL-SL-JAVA-1068: Exception during singleCall generateSqlForExportSpec 
java.lang.IllegalArgumentException: bucket
org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:144)
org.apache.hadoop.fs.s3a.S3AUtils.propagateBucketOptions(S3AUtils.java:1152)
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:374)
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
com.exasol.cloudetl.bucket.Bucket.fileSystem$lzycompute(Bucket.scala:70)
com.exasol.cloudetl.bucket.Bucket.fileSystem(Bucket.scala:69)
com.exasol.cloudetl.scriptclasses.TableExportQueryGenerator$.deleteBucketPathIfRequired(TableExportQueryGenerator.scala:50)
com.exasol.cloudetl.scriptclasses.TableExportQueryGenerator$.generateSqlForExportSpec(TableExportQueryGenerator.scala:28)
com.exasol.cloudetl.scriptclasses.DockerTableExportQueryGenerator$.generateSqlForExportSpec(DockerTableExportQueryGenerator.scala:17)
com.exasol.cloudetl.scriptclasses.DockerTableExportQueryGenerator.generateSqlForExportSpec(DockerTableExportQueryGenerator.scala)
com.exasol.ExaWrapper.runSingleCall(ExaWrapper.java:100)
redcatbear commented 2 years ago

@morazow, I remember that @jakobbraun recently solved the dot-in-filenames issue in the VS thanks to the new Connection definitions with split bucket path components. Is the same fix applicable here?

morazow commented 2 years ago

It is also applied here #120. Above issue was reported when exporting, and only strange thing was bucket name. At this moment, I am not sure what causes this issue. But dots in the name may be reason.

The exception line S3AUtils.java#L1152, just checks that bucket name is not an empty string.

So splitting and reassembling might still help, I am going to check it.

morazow commented 2 years ago

Hey all, I have looked into this issue. The main reason for failure, is that java.net.URI getHost method does not work for bucket names that end in numbers.

    @ParameterizedTest
    @CsvSource({ //
            "s3a://exa.test.aws.s3.bucket.01.etl/", //
            "s3a://bucket.name.dots.007.s3.amazonaws.com/", //
            "s3a://007.s3.amazonaws.com/", //
            "s3a://007/", //
            "s3a://007L/", //
    })
    void testS3BucketURIValid(final String bucketPath) throws URISyntaxException {
        final URI uri = new URI(bucketPath);
        assertThat(uri.getHost(), is(notNullValue()));
    }

    @ParameterizedTest
    @CsvSource({ //
            "s3a://exa.test.aws.s3.bucket.etl.01/", //
            "s3a://exa.test.aws.s3.bucket.etl.01/key", //
            "s3a://bucket.name.dots.007/", //
            "s3a://pre.007/", //
    })
    void testS3BucketURIInvalid(final String bucketPath) throws URISyntaxException {
        final URI uri = new URI(bucketPath);
        assertThat(uri.getHost(), equalTo(null));
    }

AWS SDK should also fail for all schemes other than s3. For s3 scheme, it uses URI authority.

From S3 bucket naming rules:

So maybe it is not allowed to end a bucket name with a number.

I am going to add early check for this in the project with user friendly exception.

jakobbraun commented 2 years ago

you could also take the chance and switch to the unified API...

morazow commented 2 years ago

That would be really good. But I am not aware any library for JVM that can unify them.

Only one is hadoop-tools, but here even the GCS is separately provided.