awslabs / kinesis-aggregation

AWS libraries/modules for working with Kinesis aggregated record data
Apache License 2.0
378 stars 153 forks source link

python lib is not maintained (no python3 support) #22

Closed itamarla closed 6 years ago

itamarla commented 7 years ago

Is there a reason for the python part not being available for python3.x? There is lambda support for python 3.6 yet we will not be able to use it if this lib is not updated.

Thanks, Itamar

ghost commented 7 years ago

The short answer is that when this project was developed, AWS Lambda didn't actually support Python3. That's a fairly recent release (https://aws.amazon.com/about-aws/whats-new/2017/04/aws-lambda-supports-python-3-6/) and you're the first person to ask about it.

We'll work on getting support for that in.

jessmchung commented 6 years ago

Any updates on adding support?

ghost commented 6 years ago

No, sorry @jessmchung , had lost track of this. I'll try to start looking into it in the next week or so. What level of urgency do you have?

jessmchung commented 6 years ago

Definitely within the next two months pls!

ghost commented 6 years ago

@jessmchung my current plan is to add concurrent Python 2.7.x and Python 3.x support to the existing Python code by adding a dependency on the six library (https://pypi.python.org/pypi/six) as well as get some unit tests added in. Does that sound reasonable?

jessmchung commented 6 years ago

@brentnash That sounds excellent!

ghost commented 6 years ago

Ongoing work happening here:

https://github.com/awslabs/kinesis-aggregation/tree/py3-compat

ghost commented 6 years ago

First round of updates is complete. There's a working unit test that runs an end-to-end aggregate & deaggregate successfully on both Python 2.7.x and Python 3.6.x. (Have not merged back to master yet)

I'm working on adding some more unit testing and will try to do some end-to-end testing.

I will probably bump the version number to 1.1 when I push to PyPi.

quiver commented 6 years ago

I really appreciate Py3 support.

Tested py3-compat branch and worked like a charm except for one.

For some py3 environments, pip is named pip3(or the like). In that case, make_lambda_build.py fails with 'You do not have "pip" installed or it is not on your PATH.` error.

For example, on upcoming AL2, if you install py3 with $ sudo amazon-linux-extras install python3, pips are installed as follows:

$ ls /usr/bin/ | grep pip
pip3
pip-3
pip-3.6
pip3.6
ghost commented 6 years ago

Thank you for taking a look at it @quiver ! Much appreciated.

Nice catch on the pip3 thing.

I checked in a change to invoke the pip module directly which I think should get around the issue. i.e.:

python -m pip install -r requirements.txt

Give it a shot if you get a chance. I tested with python 2 & python 3 on my system successfully.

quiver commented 6 years ago

@brentnash

python -m pip install -r requirements.txt

This hardcoded approach does not work if python is linked to Py2 and its target runtime is Py3.

What about using sys.executable instead?

This should work as long as make_lambda_build.py is invoked from the corresponding interpreter.

$ git diff -u
diff --git a/python/make_lambda_build.py b/python/make_lambda_build.py
index 41aa05a..97a2857 100644
--- a/python/make_lambda_build.py
+++ b/python/make_lambda_build.py
@@ -97,7 +97,7 @@ def install_dependencies():
     print('')
     print('Installing necessary modules from pip...')
     requirements_file = os.path.join(proj_dir, REQUIREMENTS_FILE_NAME)
-    pip_install_cmd = 'python -m pip install -r {} -t "{}"'.format(requirements_file, build_dir)
+    pip_install_cmd = '{} -m pip install -r {} -t "{}"'.format(sys.executable, requirements_file, build_dir)
     print(pip_install_cmd)
     pip_install_cmd_line_result = subprocess.call(pip_install_cmd, shell=True)
     if pip_install_cmd_line_result != 0:

Tested with

ghost commented 6 years ago

@quiver I forgot sys.executable existed. That is a much better answer. I'll add that instead.

ghost commented 6 years ago

@jessmchung I'm still going to add some more tests, but did you want to try out the py3-compat branch before I merge to master and upload to PyPi?

jessmchung commented 6 years ago

@brentnash Yes! Will let you know asap!

ghost commented 6 years ago

Not urgent, but if you have a chance, I always appreciate extra data/validation points. Thanks!

jessmchung commented 6 years ago

Whoops - meant to respond! We tested this. It looks good for our usecases.

ghost commented 6 years ago

Great! My Mon. & Tues. this week are madness, but I'll try to wrap things up, get some more test coverage and get this pushed to PyPi by the end of the week. Thanks!

jsullivanLI commented 6 years ago

This is working for us as well! Any update on when it will be released?

ghost commented 6 years ago

Apologies, got swamped. Likely early next week at this point.

jarrellmark commented 6 years ago

Great to hear @brentnash looking forward to the Python 3 support!! Will it be available on pip?

ghost commented 6 years ago

That's the plan!

ghost commented 6 years ago

As an update to anyone watching, I'm working on this today, so I'll update with progress later on.

jessmchung commented 6 years ago

YAY!!! bitmoji

ghost commented 6 years ago

Added a bunch more unit tests:

https://github.com/awslabs/kinesis-aggregation/commits/py3-compat

I generated a bunch of data with the actual Kinesis Producer Library (KPL) to compare against.

Working on merging to master now and then I'll start on the PyPi distribution.

ghost commented 6 years ago

Live on Test PyPi at the moment:

https://testpypi.python.org/pypi/aws_kinesis_agg

Running some last tests.

ghost commented 6 years ago

Alright, new version should be live:

https://pypi.python.org/pypi/aws_kinesis_agg/1.1.0

Note the new version number is v1.1.0 (was previously 1.0.1). I've done some basic verification via both Python 2.7.14 and Python 3.6.4. Give it a shot and let me know how it goes for all of you (@jessmchung @jarrellmark @jsullivanLI)

I'll leave this issue open until we get some independent confirmation. I may also make a few documentation updates as well.

jarrellmark commented 6 years ago

Thanks for the port, @brentnash !

I had to make a slight modification to my code to get it working in python 3.6. I'm not sure if this is a bug just pointing it out. A sample is below demonstrating the change:

import json
import uuid

import boto3
import aws_kinesis_agg.aggregator

kinesis_client = None

def send_record(agg_record):
    global kinesis_client

    partition_key, explicit_hash_key, data = agg_record.get_contents()

    print("partition_key: {partition_key}".format(partition_key=partition_key))
    print("explicit_hash_key: {explicit_hash_key}".format(explicit_hash_key=explicit_hash_key))
    print("data: {data}".format(data=data))

    """
    The commented out code below worked in python 2.7
    """

    # kinesis_client.put_record(
    #     StreamName='STREAM_NAME',
    #     Data=data,
    #     PartitionKey=partition_key,
    #     ExplicitHashKey=explicit_hash_key)

    kinesis_client.put_record(
        StreamName='STREAM_NAME',
        Data=data,
        PartitionKey=partition_key)

if __name__ == '__main__':
    print(json.dumps({
        'id': str(uuid.uuid4()),
        'message': "Sample Message"}))

    session = boto3.Session(profile_name='default')
    kinesis_client = session.client('kinesis')

    kinesis_agg = aws_kinesis_agg.aggregator.RecordAggregator()
    kinesis_agg.on_record_complete(send_record)

    for message_number in range(5):
        partition_key = str(uuid.uuid4())

        kinesis_agg.add_user_record(
            partition_key=partition_key,
            data=json.dumps({
                    'id': partition_key,
                    'message': "Message #{message_number}".format(message_number=message_number)}) \
                .encode('utf-8'))

    send_record(kinesis_agg.clear_and_get())

It looks like the older add_user_record method calculated its own explicit_hash_key value if None was given.

ghost commented 6 years ago

@jarrellmark The good news is that this was indeed intentional, but definitely needs to be better documented in this project. Part of the reason I bumped the version from 1.0 to 1.1 was that I wanted to imply there might be some changes in functionality.

Let me try to run down the rationale briefly (in case you're interested)...

The Kinesis Producer Library (KPL) has code (https://github.com/awslabs/amazon-kinesis-producer/blob/master/aws/kinesis/core/user_record.cc#L22) that appears to generate an explicit hash key (EHK) from a partition key (PK) when one isn't present. That's what this project was doing previously (although the calculation was actually incorrect so I fixed that too). In reality though, I ran a bunch of tests with the actual KPL and if you give it a null EHK, it doesn't auto-generate one, it literally just doesn't put an EHK into the aggregated message and you re-receive it as null on the other end via the Kinesis Client Library (KCL) when the aggregated record is unpacked. So as far as I can tell, in normal circumstances, that code isn't actually used.

Furthermore, as describe in https://github.com/awslabs/kinesis-aggregation/issues/11, specifying your own EHKs can actually cause some unintended and unpleasant side effects if you're using the actual Kinesis Client Library (KCL). If you don't specify an EHK, it will just create one for itself by hashing the PK. You can find the concerning behavior here: https://github.com/awslabs/amazon-kinesis-client/blob/master/src/main/java/com/amazonaws/services/kinesis/clientlibrary/types/UserRecord.java#L243. The fallout is essentially that if any of your PKs or EHKs for your user records inside the aggregated record don't match the shard the aggregated record itself is sent to, the KCL will drop them. So in that sense, not sending an EHK on your behalf seems like a safer bet (in addition to being more in-line with the actual KPL). FWIW, I don't believe the Python deaggregation modules in this project have that issue, but it's worth knowing about.

Sorry, that was a lot more text than I thought it was going to be. Anyway, that's the rationale; if you disagree or have other ideas, I'm happy to hear them.

jarrellmark commented 6 years ago

Sounds good, thanks for the explanation, @brentnash.

ghost commented 6 years ago

It's been a few week with no further comments, so I'm going to go ahead and resolve this. Feel free to reopen if there are any further issues.