awslabs / kinesis-aggregation

AWS libraries/modules for working with Kinesis aggregated record data
Apache License 2.0
378 stars 153 forks source link

Feature Request: Support for non-MD5 hashing #180

Closed DylanKroft closed 10 months ago

DylanKroft commented 1 year ago

As part of some work we are doing, we are not able to use MD5 hashing at all. However, this package currently only supports MD5 hashing in python/aws_kinesis_agg:

`def _serialize_to_bytes(self): """Serialize this record to bytes. Has no side effects (i.e. does not affect the contents of this record object).

    Returns: 
        A byte array containing a aggregated Kinesis record. (binary str)"""

    message_body = self.agg_record.SerializeToString()

    md5_calc = hashlib.md5()
    md5_calc.update(message_body)
    calculated_digest = md5_calc.digest()

    return aws_kinesis_agg.MAGIC + message_body + calculated_digest`

and

    if aws_kinesis_agg.MAGIC != data_magic or len(decoded_data_no_magic) <= aws_kinesis_agg.DIGEST_SIZE:
        is_aggregated = False

    if is_aggregated:            

        # verify the MD5 digest
        message_digest = decoded_data_no_magic[-aws_kinesis_agg.DIGEST_SIZE:]
        message_data = decoded_data_no_magic[:-aws_kinesis_agg.DIGEST_SIZE]

        md5_calc = hashlib.md5()
        md5_calc.update(message_data)
        calculated_digest = md5_calc.digest()

        if message_digest != calculated_digest:            
            is_aggregated = False            

Would it be possible to provide support for alternative forms of hashing?

IanMeyers commented 1 year ago

Unfortunately the KCL pollers use MD5 for explicit hash key to shard mapping. If an alternate hashing mechanism would use, these wouldn't align and messages would very likely be dropped. Not really possible to workaround at this time.

IanMeyers commented 1 year ago

Having said that, if no KCL is used in your application (and no AWS Lambda), then this could be modified within the aggregation and deaggregation libraries alone. Happy to take a PR that provides this as a non-default configuration.

IanMeyers commented 10 months ago

Closing due to inactivity