DrGFreeman / dynamo-pandas

Make working with pandas data and AWS DynamoDB easy
https://dynamo-pandas.readthedocs.io/en/stable/
MIT License
21 stars 6 forks source link

Tables with GSI & LSI? #54

Open SecretSquirrel-origami opened 3 years ago

SecretSquirrel-origami commented 3 years ago

Hi, firstly this package looks like it could really make my life easier, so thanks for putting the time in! i'm not a dynamoDB expert, so sorry if this is a stupid error on my part. I'm receiving a client error when working with 'get_df' on dynamo tables that have either GSI or LSI: "An error occurred (ValidationException) when calling the BatchGetItem operation: The provided key element does not match the schema"

Following your examples, it's working for all tables that dont have a GSI or LSI, should i be using a different "keys" / query structure for those tables?

DrGFreeman commented 3 years ago

Hi @SecretSquirrel-origami, Thanks for the feedback.

I haven't tried using tables with GSI or LSI yet. I'll try to reproduce the issue when I get a minute.

DrGFreeman commented 3 years ago

After review of the boto3 DynamoDB docs, it appears that the DynamoDB resource.batch_get_item function used by the get_df function does not support GSI or LSI.

The client.query, client.scan, table.query & table.scan support the IndexName parameter required to specify an alternative index.

If that can help, you can use boto3 functions to query your GSI or LSI and still benefit from the data type conversion classes in the dynamo_pandas.serde module as in the example below.

  1. Generate a table with a Local Secondary Index with data from the elections dataset included in the plotly library.

    import boto3
    import botocore
    
    import pandas as pd
    import plotly
    
    import dynamo_pandas as dp
    
    # Load the dataset
    df = plotly.data.election()
    
    # Create the table
    table_name = "elections"
    
    ddb_client = boto3.client("dynamodb")
    
    ddb_client.create_table(
        TableName=table_name,
        AttributeDefinitions=[
            {
                "AttributeName": "district",
                "AttributeType": "S",
            },
            {
                "AttributeName": "winner",
                "AttributeType": "S",
            },
        ],
        KeySchema=[
            {
                "AttributeName": "district",
                "KeyType": "HASH",
            }
        ],
        GlobalSecondaryIndexes=[
            {
                "IndexName": "winner",
                "KeySchema": [
                    {
                        "AttributeName": "winner",
                        "KeyType": "HASH",
                    },
                ],
                "Projection": {
                    "ProjectionType": "ALL"
                },
                "ProvisionedThroughput": {
                    'ReadCapacityUnits': 5,
                    'WriteCapacityUnits': 5
                },
            },
        ],
        ProvisionedThroughput={
            'ReadCapacityUnits': 5,
            'WriteCapacityUnits': 5
        },
    )
    
    # Put the data in the table
    dp.put_df(df, table=table_name)
  2. Query the Local Secondary Index

    from boto3.dynamodb.conditions import Key
    
    ddb_resource = boto3.resource("dynamodb")
    table = ddb_resource.Table(table_name)
    
    items = table.query(
        IndexName="winner",
        KeyConditionExpression=Key("winner").eq("Joly"),
    )["Items"]

    This will return items with numerical values as Decimal type which, when loaded in a pandas dataframe have object dtype instead of integers / floats:

    [{'winner': 'Joly',
      'Joly': Decimal('1908'),
      'district_id': Decimal('43'),
      'district': '43-Fort-Rolland',
      'total': Decimal('4438'),
      'result': 'plurality',
      'Coderre': Decimal('1325'),
      'Bergeron': Decimal('1205')}]
  3. Convert the data types using the dynamo_pandas.serde serialization/deserialization classes and load into a pandas dataframe

    from dynamo_pandas.serde import TypeSerializer, TypeDeserializer
    
    ts = TypeSerializer()
    td = TypeDeserializer()
    
    items = td.deserialize(ts.serialize(items))
    
    new_df = pd.DataFrame(items)

    new_df now has int64 dtype for the numerical columns.

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 14 entries, 0 to 13
    Data columns (total 8 columns):
     #   Column       Non-Null Count  Dtype 
    ---  ------       --------------  ----- 
     0   winner       14 non-null     object
     1   Joly         14 non-null     int64 
     2   district_id  14 non-null     int64 
     3   district     14 non-null     object
     4   total        14 non-null     int64 
     5   result       14 non-null     object
     6   Coderre      14 non-null     int64 
     7   Bergeron     14 non-null     int64 
    dtypes: int64(5), object(3)
    memory usage: 1.0+ KB
DrGFreeman commented 3 years ago

I'll think of how I could implement a new query_df function that would leverage the boto3 table.query function and allow users to specify the index to use (IndexName parameter) and the query parameters (KeyConditionExpression parameter).

This function would have the advantage (over plain boto3 calls) of automatically handling paging and data type conversion of results.

@SecretSquirrel-origami, let me know if you think this would be useful.

SecretSquirrel-origami commented 3 years ago

Thanks for taking the time to investigate this further, your suggested work around using boto3 functions & 'table.query' is what I had to do, although your use of 'dynamo_pandas.serde' is a useful snippet which which would also help in the short term.

Regarding a new "query_df" function, yes in my opinion that would be super useful. Thanks again!

chrismrutherford commented 3 years ago

I use a dynamo sort key for working with time series data and being able to transparently transfer time series data between a pandas data frame and dynamo would be really useful (and amazing). Is there any way you could implement features that allow get/put using partition + sort key so working with dynamo time series data would be easier. i.e. query KeyConditions. e.g. eq | le | lt | ge | gt | begins_with | between.

DrGFreeman commented 3 years ago

@chrismrutherford, thanks for the feedback and suggestion. I'll consider including a key_condition parameter in the new query_df function I proposed above.

In the meantime, as I suggested to @SecretSquirrel-origami above, you can still leverage some of the library's functionalities to perform the data types conversions for you.

DrGFreeman commented 3 years ago

In reference to the ValidationError reported by @SecretSquirrel-origami:

I'm receiving a client error when working with 'get_df' on dynamo tables that have either GSI or LSI: "An error occurred (ValidationException) when calling the BatchGetItem operation: The provided key element does not match the schema"

Beyond the addition of a query_df function as discussed above, a few improvements could be made:

  1. Clarify the documentation of the get_df, transactions.get_item and transaction.get_items functions to indicate that the keys or key parameters apply only to the primary key.
  2. Potentially intercept the ValidationError and add the information that the keys or key parameter only work with the primary key. Also refer to the query function(s) when available.