CODAIT / stocator

Stocator is high performing connector to object storage for Apache Spark, achieving performance by leveraging object storage semantics.
Apache License 2.0
110 stars 72 forks source link

Cannot use append mode when writing spark dataframe on Watson Studio #197

Open charles2588 opened 6 years ago

charles2588 commented 6 years ago

Write the file once

              .option("codec", "")\
              .save(cos.url('TESTAPPEND/CARS', 'catalogdsxreproduce4a77ab6a4f2f47b3b6bedc7174a64c4a'))

First append mode write is successful. and then

Lets write again in append mode and it fails

              .option("codec", "")\
              .save(cos.url('TESTAPPEND/CARS', 'catalogdsxreproduce4a77ab6a4f2f47b3b6bedc7174a64c4a'))

Py4JJavaError: An error occurred while calling : org.apache.hadoop.fs.FileAlreadyExistsException: mkdir on existing directory cos://catalogdsxreproduce4a77ab6a4f2f47b3b6bedc7174a64c4a.os_a9bbfb9f99684afe9ec11076b75f1831_configs/TESTAPPEND/CARS at at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob( at at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp

Full notebook:-

Looking at the append method in the connector code, i see append is not supported.

 public FSDataOutputStream append(Path f, int bufferSize,
      Progressable progress) throws IOException {
    throw new IOException("Append is not supported in the object storage");

If append is not supported, is there a workaround or may be the connector should throw that append is not supported rather than above error.

gilv commented 6 years ago

@charles2588 thanks for reporting this. In general append + object storage is usually a bad idea, no matter which connector you use. I will review the issue you observed to better understand the root cause and to propose the best solution to resolve it.