Dynamoid / dynamoid

Ruby ORM for Amazon's DynamoDB.
MIT License
573 stars 197 forks source link

Batching with more than 1000 doesn't have any effect? #735

Closed nbulaj closed 4 months ago

nbulaj commented 4 months ago

Hey :wave:

I have 14k records in Dynamo. Loading in batches of 1000 records doing 14 requests which take really big amount of time. I've tried to increase batch size to be 5000, but I don't see any effect. Am I doing it incorrectly?

Model.where(group_id: group.id).count
=> 14036
Model.where(group_id: group.id).batch(5000).to_a

Produces:

[Aws::DynamoDB::Client 200 0.109639 0 retries] describe_table(table_name:"dynamo_db_table")  

[Aws::DynamoDB::Client 200 0.445154 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil)  

[Aws::DynamoDB::Client 200 0.373942 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"b284dfaa-f089-4ea2-a416-3ccdec3cf66e"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1694126735.581"}})  

[Aws::DynamoDB::Client 200 0.328144 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"cbb7693f-1e80-401e-98b7-11723067c9ed"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1694549243.57"}})  

[Aws::DynamoDB::Client 200 0.448443 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"9e4e0d12-d095-4eae-8db7-47ce202787dc"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1694793587.021"}})  

[Aws::DynamoDB::Client 200 0.506392 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"d258247b-908b-478a-9f54-da7897307f2a"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1695312611.55"}})  

[Aws::DynamoDB::Client 200 0.33047 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"8680ac1b-c8c8-4f33-ae53-41b4539634a2"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1695757354.73"}})  

[Aws::DynamoDB::Client 200 0.333183 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"21cc1acd-e76d-40ea-8b52-daba98d4fba7"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1696005175.034"}})  

[Aws::DynamoDB::Client 200 0.307483 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"1d5a4f50-ccee-438e-922a-21a76ca75e86"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1696528486.424"}})  

[Aws::DynamoDB::Client 200 0.301908 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"1650728f-8c77-4a06-be7b-aaaddd5bdc29"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1698264788.455"}})  

[Aws::DynamoDB::Client 200 0.380183 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"a56e019c-12df-4792-b1b6-6432321838ff"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1699024195.01"}})  

[Aws::DynamoDB::Client 200 0.353776 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"3499083b-6380-451f-8bac-7efaea522e02"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1699991445.023"}})  

[Aws::DynamoDB::Client 200 0.478534 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"f0afcf38-6078-4431-9ff5-c03be5c1f544"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1701362453.669"}})  

[Aws::DynamoDB::Client 200 0.281876 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"c71f22e9-d571-4d4a-9d90-987cf0ca9fac"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1701808087.111"}})  

[Aws::DynamoDB::Client 200 0.243167 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"249b43e0-741a-4b36-a3c6-90c6a8bece3e"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1702486987.805"}})  

Even batching with .record_limit(5000).batch(5000) is doing more requests then I expect :thinking:

Benchmark.measure { Model.where(group_id: group.id).record_limit(5000).batch(5000).to_a; nil }

[Aws::DynamoDB::Client 200 0.39229 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:5000,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil)  

[Aws::DynamoDB::Client 200 0.409715 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:4374,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"b284dfaa-f089-4ea2-a416-3ccdec3cf66e"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1694126735.581"}})  

[Aws::DynamoDB::Client 200 0.38134 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:3551,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"cbb7693f-1e80-401e-98b7-11723067c9ed"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1694549243.57"}})  

[Aws::DynamoDB::Client 200 0.347046 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:2296,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"9e4e0d12-d095-4eae-8db7-47ce202787dc"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1694793587.021"}})  

[Aws::DynamoDB::Client 200 0.400678 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:1124,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"d258247b-908b-478a-9f54-da7897307f2a"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1695312611.55"}})  

[Aws::DynamoDB::Client 200 0.225887 0 retries] query(consistent_read:false,scan_index_forward:true,index_name:"group_id-timestamp-index",limit:704,table_name:"dynamo_db_table",key_conditions:{"group_id"=>{comparison_operator:"EQ",attribute_value_list:[{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"}]}},query_filter:{},attributes_to_get:nil,exclusive_start_key:{"id"=>{s:"8680ac1b-c8c8-4f33-ae53-41b4539634a2"},"group_id"=>{s:"dc320944-dfe3-4e67-871d-38c973e7f1d5"},"timestamp"=>{n:"1695757354.73"}})  
andrykonchin commented 4 months ago

Yeah, it seems suspicious.

The only explanation that comes into my mind is that a page of 5000 items exceeds a 1Mb limit and 5000 items limit just isn't applied:

Limit The maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off. Also, if the processed dataset size exceeds 1 MB before DynamoDB reaches this limit, it stops the operation and returns the matching values up to the limit, and a key in LastEvaluatedKey to apply in a subsequent operation to continue the operation. For more information, see Query and Scan in the Amazon DynamoDB Developer Guide.

https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html#API_Query_RequestSyntax

Could you evaluate size of each fetched item?

To assess approximately item size the following rule should be used - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/CapacityUnitCalculations.html. I assume we can just serialize item attributes into a JSON document for simplicity:

size = 0
Model.where(group_id: group.id).batch(5000).each |model|
  size += JSON.dump(model.attributes).size
end

Another way is to rely on ConsumedCapacity in response to get how many units were used so we can see if the 1M limit is reached (for every page/Query request).

nbulaj commented 4 months ago

Thanks @andrykonchin . Yeah, overall size is 18969308 (~19 mb). Let me evaluate more on that.. 1000 items take 1-1.5 megabytes

nbulaj commented 4 months ago

BTW is it possible to perform a BatchGetItem somehow using Dynamoid? :crossed_fingers: I see it has same-named class, but not sure if it's possible to use it on a model level. Also I see it requires IDs :thinking:

andrykonchin commented 4 months ago

Yeah, BatchGetItem requires primary ids to be specified (documentation). It's used in the .find method when several ids are passed. There is a limit of 100 items per call so it's just a variation of GetItem.

nbulaj commented 4 months ago

OK at the end I really think we're just going beyond the limits so it has nothing to Dynamoid. Thanks Andrii :bow: :ukraine: