IBMStreams / streamsx.objectstorage

The com.ibm.streamsx.objectstorage toolkit supports Object Storage services with S3 API like IBM Cloud Object Storage service.
https://ibmstreams.github.io/streamsx.objectstorage
Other
4 stars 7 forks source link

`S3ObjectStorageSink` parameter `skipPartitionAttributes` does not remove partition attributes #229

Closed joemirizio closed 3 years ago

joemirizio commented 3 years ago

As a user I want to write partitioned data to S3 without the data containing the partition fields.

I am able to write partitioned parquet data to S3 using the S3ObjectStorageSink, but it contains the partition attributes. Setting skipPartitionAttributes seems to have no effect, and the fields are still written

Test Case

To reproduce, I created a simple SPL example.

Note: skipPartitionAttributes: true ;

namespace com.namespace ;

use com.ibm.streamsx.objectstorage.s3::S3ObjectStorageSink ;

composite PartitionTest {
  param
    expression<rstring> $s3Endpoint : getSubmissionTimeValue("s3Endpoint") ;
    expression<rstring> $s3Bucket : getSubmissionTimeValue("s3Bucket") ;
    expression<rstring> $s3AccessKeyID :
      getSubmissionTimeValue("s3AccessKeyID") ;
    expression<rstring> $s3SecretAccessKey :
      getSubmissionTimeValue("s3SecretAccessKey") ;

  graph
    stream<rstring id, int32 year, int32 month, int32 day> PartitionedData = Beacon() {
      param
        iterations: 10u ;
      output
        PartitionedData:
          id = (rstring)(random() * 10.0),
          year = (int32)2000,
          month = (int32)1,
          day = (int32)1;
    }

    () as ParquetData = S3ObjectStorageSink(PartitionedData)
    {
      param
        storageFormat : "parquet" ;
        objectName : "streams-test/obs-%OBJECTNUM.parquet" ;
        endpoint : $s3Endpoint ;
        bucket : $s3Bucket ;
        accessKeyID : $s3AccessKeyID ;
        secretAccessKey : $s3SecretAccessKey ;
        sslEnabled : false ;
        parquetCompression : "SNAPPY" ;
        parquetEnableDict : true ;
        tuplesPerObject : (int64)10 ;
        partitionValueAttributes: "year", "month", "day" ;
        skipPartitionAttributes: true ;
    }
}

This correctly creates a file at s3://mybucket/streams-test/year=2000/month=1/day=1/obs-0.parquet.

Viewing the parquet file with parquet-tools, the partition attributes (year, month, day) are still present in the data.

{"id":"0.344703700393438","year":2000,"month":1,"day":1}
{"id":"4.51050510630012","year":2000,"month":1,"day":1}
{"id":"5.76032183133066","year":2000,"month":1,"day":1}
{"id":"1.7222639080137","year":2000,"month":1,"day":1}
{"id":"0.329583613201976","year":2000,"month":1,"day":1}
{"id":"4.96657963376492","year":2000,"month":1,"day":1}
{"id":"9.96496303007007","year":2000,"month":1,"day":1}
{"id":"9.75459394976497","year":2000,"month":1,"day":1}
{"id":"5.66759054549038","year":2000,"month":1,"day":1}
{"id":"5.80386402085423","year":2000,"month":1,"day":1}

Is this argument not working as intended or am I misunderstanding its purpose?

markheger commented 3 years ago

Parameter is ignored at schema generation. Parquet schema generation should skip partition attributes in list partitionValueAttributes if skipPartitionAttributes is true.

markheger commented 3 years ago

resolved in v2.2.4 https://github.com/IBMStreams/streamsx.objectstorage/releases/tag/v2.2.4