Open SecretSquirrel-origami opened 3 years ago
Hi @SecretSquirrel-origami, Thanks for the feedback.
I haven't tried using tables with GSI or LSI yet. I'll try to reproduce the issue when I get a minute.
After review of the boto3 DynamoDB docs, it appears that the DynamoDB resource.batch_get_item
function used by the get_df
function does not support GSI or LSI.
The client.query
, client.scan
, table.query
& table.scan
support the IndexName
parameter required to specify an alternative index.
If that can help, you can use boto3 functions to query your GSI or LSI and still benefit from the data type conversion classes in the dynamo_pandas.serde
module as in the example below.
Generate a table with a Local Secondary Index with data from the elections dataset included in the plotly library.
import boto3
import botocore
import pandas as pd
import plotly
import dynamo_pandas as dp
# Load the dataset
df = plotly.data.election()
# Create the table
table_name = "elections"
ddb_client = boto3.client("dynamodb")
ddb_client.create_table(
TableName=table_name,
AttributeDefinitions=[
{
"AttributeName": "district",
"AttributeType": "S",
},
{
"AttributeName": "winner",
"AttributeType": "S",
},
],
KeySchema=[
{
"AttributeName": "district",
"KeyType": "HASH",
}
],
GlobalSecondaryIndexes=[
{
"IndexName": "winner",
"KeySchema": [
{
"AttributeName": "winner",
"KeyType": "HASH",
},
],
"Projection": {
"ProjectionType": "ALL"
},
"ProvisionedThroughput": {
'ReadCapacityUnits': 5,
'WriteCapacityUnits': 5
},
},
],
ProvisionedThroughput={
'ReadCapacityUnits': 5,
'WriteCapacityUnits': 5
},
)
# Put the data in the table
dp.put_df(df, table=table_name)
Query the Local Secondary Index
from boto3.dynamodb.conditions import Key
ddb_resource = boto3.resource("dynamodb")
table = ddb_resource.Table(table_name)
items = table.query(
IndexName="winner",
KeyConditionExpression=Key("winner").eq("Joly"),
)["Items"]
This will return items with numerical values as Decimal
type which, when loaded in a pandas dataframe have object
dtype instead of integers / floats:
[{'winner': 'Joly',
'Joly': Decimal('1908'),
'district_id': Decimal('43'),
'district': '43-Fort-Rolland',
'total': Decimal('4438'),
'result': 'plurality',
'Coderre': Decimal('1325'),
'Bergeron': Decimal('1205')}]
Convert the data types using the dynamo_pandas.serde
serialization/deserialization classes and load into a pandas dataframe
from dynamo_pandas.serde import TypeSerializer, TypeDeserializer
ts = TypeSerializer()
td = TypeDeserializer()
items = td.deserialize(ts.serialize(items))
new_df = pd.DataFrame(items)
new_df
now has int64
dtype for the numerical columns.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 winner 14 non-null object
1 Joly 14 non-null int64
2 district_id 14 non-null int64
3 district 14 non-null object
4 total 14 non-null int64
5 result 14 non-null object
6 Coderre 14 non-null int64
7 Bergeron 14 non-null int64
dtypes: int64(5), object(3)
memory usage: 1.0+ KB
I'll think of how I could implement a new query_df
function that would leverage the boto3 table.query
function and allow users to specify the index to use (IndexName
parameter) and the query parameters (KeyConditionExpression
parameter).
This function would have the advantage (over plain boto3 calls) of automatically handling paging and data type conversion of results.
@SecretSquirrel-origami, let me know if you think this would be useful.
Thanks for taking the time to investigate this further, your suggested work around using boto3 functions & 'table.query' is what I had to do, although your use of 'dynamo_pandas.serde' is a useful snippet which which would also help in the short term.
Regarding a new "query_df" function, yes in my opinion that would be super useful. Thanks again!
I use a dynamo sort key for working with time series data and being able to transparently transfer time series data between a pandas data frame and dynamo would be really useful (and amazing). Is there any way you could implement features that allow get/put using partition + sort key so working with dynamo time series data would be easier. i.e. query KeyConditions. e.g. eq | le | lt | ge | gt | begins_with | between.
@chrismrutherford, thanks for the feedback and suggestion. I'll consider including a key_condition
parameter in the new query_df
function I proposed above.
In the meantime, as I suggested to @SecretSquirrel-origami above, you can still leverage some of the library's functionalities to perform the data types conversions for you.
In reference to the ValidationError
reported by @SecretSquirrel-origami:
I'm receiving a client error when working with 'get_df' on dynamo tables that have either GSI or LSI: "An error occurred (ValidationException) when calling the BatchGetItem operation: The provided key element does not match the schema"
Beyond the addition of a query_df
function as discussed above, a few improvements could be made:
get_df
, transactions.get_item
and transaction.get_items
functions to indicate that the keys
or key
parameters apply only to the primary key.ValidationError
and add the information that the keys
or key
parameter only work with the primary key. Also refer to the query
function(s) when available.
Hi, firstly this package looks like it could really make my life easier, so thanks for putting the time in! i'm not a dynamoDB expert, so sorry if this is a stupid error on my part. I'm receiving a client error when working with 'get_df' on dynamo tables that have either GSI or LSI: "An error occurred (ValidationException) when calling the BatchGetItem operation: The provided key element does not match the schema"
Following your examples, it's working for all tables that dont have a GSI or LSI, should i be using a different "keys" / query structure for those tables?