Single Table Design: Support Patterns

bmalinconico commented 2 years ago

Hello,

I wanted to have a conversation as to how, and if at all, Dynamoid supports the common single table design for DynamoDB.

The basic TL;DR; is the existing STI with some Range Key magic. The Hash Key represents the ability to get all of a common thing, I'm going to use Pizza with a kinda contrived example.

PizzaId(HK)	ResourceId (RK)	type (Required for Deserialization)	Name	Cost	Quantity	LeftHalf	RightHalf
a	pizza	Pizza	Combo	13.99
a	topping-onions	PizzaTopping	Onion		Heavy	true	true
a	topping-peperoni	PizzaTopping	Peperoni		Light	false	true

This breaks down with the Dynamoid STI implementation.

I'd like to open a conversation about updating the ORM to allow for this pattern and want to toss around some options then identify what would need to be updated to work.

You can query for all toppings on a pizza by using Pizza ID and RK Prefix

First Idea

Allow a range prefix to be specified

class PizzaTopping
  range :resource_id, prefix_on_persistance: "topping-"
end

# There is also a need to "fix" the RK for single instance objects
class Pizza
  range :resource_id, fixed_value: "pizza"
end

andrykonchin commented 2 years ago

Hi,

In short - Dynamoid doesn't support explicitly anything related to the Single-Table design. Why? I suppose because its goal is to implement the ActiveRecord pattern.

It seems to me that the classic ActiveRecord pattern contradicts with ideas of the Single-Table design with kind of schemeless/multi-schema items. But I can easily imagine that Single-Table design's approach may be implemented on top of ORM like Dynamoid. Or on top of any other DynamoDB client.

That is I am not against supporting Single-Table design in Dynamoid. But I see benefits in separating new features and existing ActiveRecord-like approach. How strong should be this separation? I don't know right now. It depends probably on how natural specific features look from the point of view of the ActiveRecord approach.

Regarding the proposed feature with a range prefix. I am not familiar with the concrete patterns of the Single-Table design and don't know whether such range prefixes is a common/well known pattern. Could you please point at resources that describe such patterns?

thomaswitt commented 2 years ago

I totally second this idea of @bmalinconico. The prefix in range keys to differentiate between different types of data is THE access pattern in DynamoDB. Especially also when combined with a prefix and a date like "FUEL_PRICE#2022-09-19" … I'd really appreciate if Dynamoid would support this out of the box.

The advantages are obvious, especially in terms of pricing/capacity. Having one big table instead of lots of small table with their own throttle settings is hindering application performance and simply unnecessarily costs money…

thomaswitt commented 2 years ago

Regarding the proposed feature with a range prefix. I am not familiar with the concrete patterns of the Single-Table design and don't know whether such range prefixes is a common/well known pattern. Could you please point at resources that describe such patterns?

@andrykonchin - Here's the official DynamoDB doc: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html

andrykonchin commented 2 years ago

@thomaswitt Thank you for the link. Probably I've got what you are talking about.

@bmalinconico Yep, it seems it's a common approach to have a synthetic structured sort key. And it will be useful to support some predefined schemas.

On the other hand such options like prefix_on_persistance and fixed_value can be emulated with a handwritten before_save hooks. So it would be a tiny enhancement.

thomaswitt commented 2 years ago

@andrykonchin Yes, basically you could already write this with the current version, but I'd say convention over configuration is a very old tried-and-true mantra of Rails.

Apart from that, the way @bmalinconico described it, that's the way DynamoDB is originally intended to be used. I ran several huge DynamoDB based applications with tons of data and gazillions of rows. That's the only way to keep it scaling, and basically every AWS engineer will agree at an AWS summit.

The way currently Dynamoids default is designed with multi table has a lots of drawbacks when you put it into production. That's why I think the project should offer more defaults in the way of @bmalinconico 's idea.

The way how STI is currently implemented with the Type field is basically more band-aid. It should be a prefix in the range key. That would make also lots of other stuff easy.

When I started using Dynamoid I ran e.g. into this problem: https://github.com/Dynamoid/dynamoid/issues/501. It'd be way more easier if in that example Company and Report would have the same primary key and then you could filter via Range Key what you really want to get, either the company data or the report data.

I started using Dymanoid because it was very convenient and used (despite my better knowledge) the multi table approach in the beginning. I then later had to painfully rewrite the whole app to use STI in a single table, with all model classes inheriting from a Bass Class which then defines table. Still there's a lot of code in my application with like Model.where(id: id_to_search_for, metadata.begins_with: 'REPORT#'), etc. With smart range keys like COMPANY# and COMPANY#REPORT you can easily get a company and all its all reports with a single query.

All in all - just my two cents, I'd really appreciate if the single table approach would be promoted more as at least one of two default approaches and easily supported within the software, without having to write hooks, etc.

andrykonchin commented 2 years ago

Could you elaborate a bit more on how the prefix is supposed to work? What value it should be added to? At what moment - creation or every model updating?

thomaswitt commented 2 years ago

@andrykonchin A full design I would need to think about longer, but how I would approach it if you configure Dynamoid that way (let's say via config.single_table_design = true):

Dynamoid now requires in this mode that a range key is present - I would go for defaults like key: :id and range: :metadata as this is the default used by AWS e.g. in the NoSQL Workbench Modeler. You of course can still overwrite it if you want other key and range key names.
:id is generated by Dynamoid with a UUID when not specified out but can be overwritten (id: 'Berlin')
:metadata is predefined as created_at and will be automatically expanded to "#{type.upcase}##{created_at}. So the prefix_on_persistance is by default the model name (type), but can be overwritten in a way @bmalinconico proposed (or alternatively fixed as he proposed as well) like range :metadata, prefix_on_persistance: "CUSTOMPREFIX#". Also when defining your object, there should be a reserved keyword like range_id which is by default set to created_at, but you can overwrite that in case you want to have a range key like "USER#". You could then define an own range ID like an email or whatever, or change the range key from created_at to updated_at - or even dynamoically expanded and chained like COMMENT#poster@dynamoid.com#2022-03-12T00:22:33.144Z' when you set the range key to "{user_id}#‘{created_at}", etc.
The table definition (name, capacity_mode) should be defined in a bass class like DynamoidBase and all models should inherit from this base class by convention (Employee < DynamoidBase)
When I do a where search, I can either just look for the id and get multiple results or with a helper function look directly for key plus rangekey, like Comment.find('excellent-post-1234', 'poster@dynamoid.com') which would expand to Comment.where(id: 'excellent-post-1234', 'metadata': 'COMMENT#poster@dynamoid.com').
I would also potentially include intuitive helpers when looking for range keys with time series data etc. for begins_with, gt, etc, dor example (not yet a well thought out API, just an idea):
- Document.find('docset1234', '2022-03-12T00:22:33.144Z') -> Document.where(id: 'docset1234').where(metadata: '2022-03-12T00:22:33.144Z')
- Document.find('docset1234', '2022-03-*') -> Document.where(id: 'docset1234').where(metadata.begins_with: '2022-03-')
- Address.find('Berlin', '10115', '10178') -> Address.where(id: 'Berlin').where(metadata.between: [10115, 10178])
- Address.find('Berlin', '>10115') -> Address.where(id: 'Berlin').where(metadata.gt: 10115)
- Address.find('Berlin', '>=10115') -> Address.where(id: 'Berlin').where(metadata.gte: 10115)
- Address.find('Berlin', '<10115') -> Address.where(id: 'Berlin').where(metadata.lt: 10115)
- Address.find('Berlin', '<=10115') -> Address.where(id: 'Berlin').where(metadata.lte: 10115)

thomaswitt commented 1 year ago

@andrykonchin Hey Andrii, just checkin in whether you had time to think about those suggestions …

thomaswitt commented 10 months ago

@andrykonchin Just a little ping. Have you given those ideas some thought?

ckhsponge commented 9 months ago

I use STI with Dynamoid. My approach is to not have a range key and use shared GSI columns with redundant data. GSI columns are set with before actions and can include values from multiple columns as needed. Much of this logic can be abstracted into a parent class so using it in the models isn't excessive. I have 5 GSIs that are string-string and 5 that are are string-number. If I needed pizzas created by user 1 sorted by timestamp I would use a GSI e.g. Pizza#User#1,2024-02-01. To further filter to store 2 you could have a GSI with Pizza#Store#2#User#1,2024-02-01.

HK	Type	Name	Code	GSI_HK1	GSI_RK1
a#Pizza#pizza	Pizza			Pizza	2024-02-01
a#PizzaTopping#onions	PizzaTopping	Onions	onions	PizzaTopping	onions
a#PizzaTopping#pepperoni	PizzaTopping	Pepperoni	pepperoni	PizzaTopping	pepperoni

If you don't like the redundant data and want to keep using range keys I suppose you could set the range key with a before action. You'd need to create new or override existing finders, however.

thomaswitt commented 9 months ago

@ckhsponge ckhsponge I understand your approach, but the range key was invented for a reason (also in terms of data distribution). Especially also as the idea in dynamo is that you don't delete by default but rather insert new data to have a built in history, the range key comes very handy. In that sense Dynamoid is written in a way that tries to emulate ActiveRecord, but does not embrace the ideas of DynamoDB.

Unfortunately @andrykonchin doesn't seem to be open/interested to build another way which I described above which is more built like DynamoDB wants it, so I am considering to write an own lightweight Gem adapter to embrace these concepts.

Especially as it makes sense to combine this with OpenSearch/ElasticSearch, which is also currently PITA as most gems (like SearchKick) won't work out of the box. DynamoDB + Opensearch is a very powerful combination - and it deserves to be supported for Rails out of the box the way it's meant to be.

ckhsponge commented 9 months ago

@thomaswitt I like your motivation! I am also a fan of Opensearch.

If I understand correctly, using a GSI with all attributes projected could be wasting space but it does keep the items distributed as desired since it maintains a complete copy.

I do think you could accomplish what you need to with an extension on top of Dynamoid e.g. shared before-create actions and custom finders. Reworking the innards of Dynamoid to handle that would probably be possible as well although would be more involved.

bholzer commented 7 months ago

I want to use Dynamo the way it's designed and intended to be used, and I would love a Ruby library that makes these patterns easier to use. Unfortunately, Dynamoid is not the answer today. @thomaswitt please let me know if you make any efforts in your own adapter/gem. I would be happy to contribute

thomaswitt commented 7 months ago

@bholzer I agree. Would you be open to throw some ideas together and do an outline of what should be in scope for a ruby lib?

Dynamoid / dynamoid

Single Table Design: Support Patterns #568

First Idea