Open bprashanth opened 2 months ago
Our data is in the following pattern (relational)
State
|
----- District
|
-------- SubDistrict
|
---------- Village
Strong possibility of remaining relational
This is not strictly relational, it is hierarchical. But sql is good at handling hierarchical data too. List out some common queries. If joins and aggregates are common, it might further support your decision.
The main argument for nosql is flexibility at this point. Since you don't know which data sets you need to ingest and how to work that into the model. However you need to write code to do more complex joins in nosql. More effort than a simple select statement, and slower.
If we're choosing sql, MySQL or mariadb are probably the best options. You should be able to scan docs / read forums and figure this out.
Whatever prototype you're going to try, make sure it has the right kind of data and the right kind of queries so you can make an informed decision. In this regard, a full fledged django app with the wrong data/queries will be misleading.
Also think about scale. How will you scale the db to your data set. Meaning is the entire data set you will ever need to store small enough to always fit on 1 machine? What's the cost of that machine? What's the cost with SSD, and how does that line up with the managed service cost (eg cost of rds).
Finalize your comparison criterion. It sounds like you will have a very read heavy workload. Slow writes/ingestion are ok. That means caching will help a lot. Look at that as another dimension, ease of integration of a read cache into django for a given db backend.
Dimension | SQL (MySQL/MariaDB) | NoSQL (MongoDB/Other) |
---|---|---|
Data Structure Fit | Strongly relational, fits well with hierarchical data via foreign keys and joins. | NoSQL can store hierarchical data, but complex queries may require additional coding and slower performance. |
Scalability | Vertical scaling (on a single machine) is feasible; horizontal scaling needs sharding. | Better suited for horizontal scaling but may require more complex sharding strategies. |
Cost | Self-hosting on a VM can be cost-effective; managed services (RDS) add reliability but at a cost. | NoSQL managed services can be expensive, but self-hosting can reduce costs depending on infrastructure. |
Ease of Use/Django Integration | Well-integrated with Django, straightforward migrations, ORM support, and existing community tools. | Requires additional effort for complex queries; Django support exists but is less mature. |
Caching | SQL-based caching (e.g., Memcached, Redis) integrates well with Django. | NoSQL can also benefit from caching, but the implementation may vary depending on the data structure. |
Analytics | SQL databases offer strong support for analytics through complex queries and reporting tools. | NoSQL databases might offer built-in analytics but often require additional tools or custom implementations. |
Backup and Reliability | Managed SQL services (e.g., RDS) offer built-in backups and high reliability. | NoSQL managed services also offer backups, but self-hosting may require manual setup. |
Flexibility | Less flexible; schema changes can be challenging. | More flexible with schema changes, accommodating evolving data structures. |
Scale and flexibility seem to be the main pointers for nosql, given the above table and the fact that we already have a django app.
Can you put an upper limit on your scale? Max data size in 2 years, max reads or writes per second, making educated assumptions around your user base and data ingestion.
Using this estimate you can compare the cost of self hosting vs RDS.
Regarding flexibility, can you list out the data sets you might want to invest in the next 2 years? Are they different from the ones you already ingest?
A. If you minimize the use of foreign keys it will keep your migrations easy.
B. If you keep your foreign keys within a certain group, it will make sharding easier. Eg all your fks are only within a given state, then you can place each states data on a new vm or db. Then depending on where you get free credits, can shard your data even across different clouds if necessary.
C. For small datasets you can even add blob fields to certain tables and store json in those blob fields. It's terrible for performance in a SQL db, but helps by not requiring you to commit to a schema before you're sure about the foreign keys.
On Wed, Aug 28, 2024, 1:03 AM anmolsingh0219 @.***> wrote:
Dimension SQL (MySQL/MariaDB) NoSQL (MongoDB/Other) Data Structure Fit Strongly relational, fits well with hierarchical data via foreign keys and joins. NoSQL can store hierarchical data, but complex queries may require additional coding and slower performance. Scalability Vertical scaling (on a single machine) is feasible; horizontal scaling needs sharding. Better suited for horizontal scaling but may require more complex sharding strategies. Cost Self-hosting on a VM can be cost-effective; managed services (RDS) add reliability but at a cost. NoSQL managed services can be expensive, but self-hosting can reduce costs depending on infrastructure. Ease of Use/Django Integration Well-integrated with Django, straightforward migrations, ORM support, and existing community tools. Requires additional effort for complex queries; Django support exists but is less mature. Caching SQL-based caching (e.g., Memcached, Redis) integrates well with Django. NoSQL can also benefit from caching, but the implementation may vary depending on the data structure. Analytics SQL databases offer strong support for analytics through complex queries and reporting tools. NoSQL databases might offer built-in analytics but often require additional tools or custom implementations. Backup and Reliability Managed SQL services (e.g., RDS) offer built-in backups and high reliability. NoSQL managed services also offer backups, but self-hosting may require manual setup. Flexibility Less flexible; schema changes can be challenging. More flexible with schema changes, accommodating evolving data structures.
— Reply to this email directly, view it on GitHub https://github.com/WELLlabs/JaltolAPI/issues/8#issuecomment-2313355571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTL5TA3SUPR2FI5GGIRSGLZTTIAXAVCNFSM6AAAAABNDNFCBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJTGM2TKNJXGE . You are receiving this because you authored the thread.Message ID: @.***>
RDS works within cost budget (free tier). Does RDS require an "rds instance" that costs money?
AI: File a bug to reevaluate RDS usage.
Something to reevaluate in ~6m because we think the free tier is for 12m. For now we can close this issue as well.
Free Tier AWS offers a free tier for Amazon RDS for the first 12 months after you sign up for AWS. With the free tier, you get the following benefits:
750 hours of RDS Single-AZ db.t3.micro instance usage per month (for MySQL, PostgreSQL, MariaDB). 20 GB of storage (general purpose SSD). 20 GB of backup storage.
Data Transfer: The first 1 GB of data transfer out to the internet is free each month. Afterward, it's around $0.09 per GB (varies by region).
Can we afford the cost of RDS forever? (i.e after the 12 months?) If not, let's start to investigate EBS/EFS. A simple cost/benefit would help here too, just like in the ELB issue.
Cost of RDS at current read/write/gb capacity: ? Cost of EBS disk for same capacity + 0.1 human + ec2 t3.micro: ?
(let's assume you will need 0.15 human (i.e 10% of your time) or about 2 days a month to manage mysql on EBS. That could be an over estimate)
The cost difference might justify moving to ebs or staying on rds.
At the end of the exercise we want to choose between sql and no sql, and further pick between different options, hosted and self hosting/maria vs mysql/mongo vs xyz.
Ideally (not a hard requirement, just helpful) a tabulated comparison with different dimensions. Eg: