`S3ObjectStorageSink` parameter `skipPartitionAttributes` does not remove partition attributes

joemirizio commented 3 years ago

As a user I want to write partitioned data to S3 without the data containing the partition fields.

I am able to write partitioned parquet data to S3 using the S3ObjectStorageSink, but it contains the partition attributes. Setting skipPartitionAttributes seems to have no effect, and the fields are still written

Test Case

To reproduce, I created a simple SPL example.

Note: skipPartitionAttributes: true ;

namespace com.namespace ;

use com.ibm.streamsx.objectstorage.s3::S3ObjectStorageSink ;

composite PartitionTest {
  param
    expression<rstring> $s3Endpoint : getSubmissionTimeValue("s3Endpoint") ;
    expression<rstring> $s3Bucket : getSubmissionTimeValue("s3Bucket") ;
    expression<rstring> $s3AccessKeyID :
      getSubmissionTimeValue("s3AccessKeyID") ;
    expression<rstring> $s3SecretAccessKey :
      getSubmissionTimeValue("s3SecretAccessKey") ;

  graph
    stream<rstring id, int32 year, int32 month, int32 day> PartitionedData = Beacon() {
      param
        iterations: 10u ;
      output
        PartitionedData:
          id = (rstring)(random() * 10.0),
          year = (int32)2000,
          month = (int32)1,
          day = (int32)1;
    }

    () as ParquetData = S3ObjectStorageSink(PartitionedData)
    {
      param
        storageFormat : "parquet" ;
        objectName : "streams-test/obs-%OBJECTNUM.parquet" ;
        endpoint : $s3Endpoint ;
        bucket : $s3Bucket ;
        accessKeyID : $s3AccessKeyID ;
        secretAccessKey : $s3SecretAccessKey ;
        sslEnabled : false ;
        parquetCompression : "SNAPPY" ;
        parquetEnableDict : true ;
        tuplesPerObject : (int64)10 ;
        partitionValueAttributes: "year", "month", "day" ;
        skipPartitionAttributes: true ;
    }
}

This correctly creates a file at s3://mybucket/streams-test/year=2000/month=1/day=1/obs-0.parquet.

Viewing the parquet file with parquet-tools, the partition attributes (year, month, day) are still present in the data.

{"id":"0.344703700393438","year":2000,"month":1,"day":1}
{"id":"4.51050510630012","year":2000,"month":1,"day":1}
{"id":"5.76032183133066","year":2000,"month":1,"day":1}
{"id":"1.7222639080137","year":2000,"month":1,"day":1}
{"id":"0.329583613201976","year":2000,"month":1,"day":1}
{"id":"4.96657963376492","year":2000,"month":1,"day":1}
{"id":"9.96496303007007","year":2000,"month":1,"day":1}
{"id":"9.75459394976497","year":2000,"month":1,"day":1}
{"id":"5.66759054549038","year":2000,"month":1,"day":1}
{"id":"5.80386402085423","year":2000,"month":1,"day":1}

Is this argument not working as intended or am I misunderstanding its purpose?

markheger commented 3 years ago

Parameter is ignored at schema generation. Parquet schema generation should skip partition attributes in list partitionValueAttributes if skipPartitionAttributes is true.

markheger commented 3 years ago

resolved in v2.2.4 https://github.com/IBMStreams/streamsx.objectstorage/releases/tag/v2.2.4

IBMStreams / streamsx.objectstorage

`S3ObjectStorageSink` parameter `skipPartitionAttributes` does not remove partition attributes #229

Test Case