aws / aws-sdk-ruby-record

Official repository for the aws-record gem, an abstraction for Amazon DynamoDB.
Apache License 2.0
318 stars 41 forks source link

Query yielding heterogeneous results #107

Closed sereneiconoclast closed 3 years ago

sereneiconoclast commented 4 years ago

From what I can tell, currently all queries are performed through a specific model type, and must therefore return only records of that type.

I'd like to execute a single query that can return records of mixed types:

{hk: "me@here.com", rk: "PhoneNumber 123-555-1234", ... more fields specific to model type 'User'}
{hk: "me@here.com", rk: "Order 1000001", ... more fields specific to model type 'Order'}
{hk: "me@here.com", rk: "Order 1000002", ... more fields specific to model type 'Order'}
{hk: "me@here.com", rk: "Review 5000", ... more fields specific to model type 'Review'}

Instead of calling User.query(...), I would then call BaseTable.query(...) (as per issue 92) and pass a Proc whose job is to examine the Hash of raw attribute values, and return a reference to the appropriate child class to instantiate (User, Order, Review). In the example above, I'd probably do that by looking at the first word of the range key. It could do something else instead, such as attempting to match each range key against a regex ("this looks like a phone number"), or switching based on some other attribute (item_type=="User"), or going by which attributes are present and which aren't.

Does this sound reasonable?

awood45 commented 4 years ago

This is definitely a use case that would be helpful to support. My thought for this is to provide an enumeration, perhaps where you signal in each loop using whatever logic you wish what model class should be used. On mobile now but can sketch out an example soon.

awood45 commented 4 years ago

Here's how I imagined this. Let's pretend we have a couple of tables here:

class Project
  include Aws::Record
  set_table_name(ENV["TABLE_NAME"])

  string_attr :uuid, hash_key: true
  string_attr :table_name, range_key: true

  string_attr :project_name
end

class Task
  include Aws::Record
  set_table_name(ENV["TABLE_NAME"])

  string_attr :uuid, hash_key: true
  string_attr :table_name, range_key: true

  string_attr :task_name
  string_attr :parent_project_uuid
  string_attr :status
end

Fairly simple example, but we could then run this against any table class:

scan = Project.build_scan.multi_model_filter do |raw_item_attributes|
  if raw_item_attributes[:table_name] == "PROJECT"
    Project
  elsif raw_item_attributes[:table_name] == "TASK"
    Task
  else
    nil
  end
end

What I'm imagining here is we let you pass in a block rather than complete!, for example, and the block returns the model class based on any manipulation of the raw item that you like, or nil if no model applies and it should be skipped. This could also apply to built queries, though as a limitation, you have to have some sort of model class to use as a starting point. It seems like a reasonable compromise though, as you could have a base class for Single-Table query building as needed.

awood45 commented 4 years ago

I should add, when you run scan.each or scan.each etc, the items in that enumeration would be in the appropriate class as specified by the filter block code.

awood45 commented 4 years ago

So presumably, if you use this logic, you need to be prepared for heterogeneous sets, but you're opting in to that behavior anyways.

sereneiconoclast commented 4 years ago

Nice. So the build_scan or build_query is returning a builder as an intermediate result, and the multi_model_filter is augmenting it... similar to RSpec's syntax for programming mocks: expect(thing).to receive(:method_name).with(...).and_return(...)

An alternate style would be to accept a Proc as an optional argument, so you could write

BaseTable.query(...normal query terms...,
  select_model: ->(raw_attributes) { ... some logic returning Project, Task, or nil }
)

This doesn't look as clean as the style you suggested, but it's probably less work to implement. I'd be happy with either.

So presumably, if you use this logic, you need to be prepared for heterogeneous sets, but you're opting in to that behavior anyways.

Yes, the straightforward behavior would be to return a single array containing objects of various types, in whatever order they were found. It might be nicer for the consumer, perhaps, to return a Hash sorting them by type:

{
  Project => [project_1, project_2, project_3...],
  Task => [task_1, task_2...]
}

...since nearly everyone will, as a first step, be sorting through the results in this fashion.

Bonus: Allowing the block to return nil to mean "skip this" means this also functions similar to aws dynamodb query --filter-expression.

alextwoods commented 4 years ago

I've got a draft PR (#108) that implements this - I'm still thinking through some behavior...

It might be nicer for the consumer, perhaps, to return a Hash sorting them by type

Since results are returned page by page it would require iterating through the entire set to build a sorted Hash which in many cases isn't desirable.

awood45 commented 4 years ago

I'd say that's actually an unacceptable outcome, it should be returned one page at a time no matter what - otherwise you can accidentally pull up millions of records.

sereneiconoclast commented 4 years ago

Since results are returned page by page it would require iterating through the entire set to build a sorted Hash which in many cases isn't desirable.

I'm not sure I see what one has to do with the other. You can still return paginated results, it's just that each page would be a Hash with items bucketed by type.

If you don't think it desirable then you could certainly leave that part out. But I know that, as a customer, the first thing I'm going to do with each heterogeneous result set is to divide them up by item type, and I would expect most consumers would do likewise.

awood45 commented 4 years ago

Yes true - the important part is the one page at a time thing.