WELLlabs / JaltolAPI

MIT License
0 stars 0 forks source link

Figure out data model and choose backend db #8

Open bprashanth opened 2 months ago

bprashanth commented 2 months ago

At the end of the exercise we want to choose between sql and no sql, and further pick between different options, hosted and self hosting/maria vs mysql/mongo vs xyz.

Ideally (not a hard requirement, just helpful) a tabulated comparison with different dimensions. Eg:

type scale cost ease of use / django integration?
mongo
maria
balakumaran247 commented 2 months ago

Relational

Our data is in the following pattern (relational)

State
  |
  ----- District
            |
            -------- SubDistrict
                             |
                             ---------- Village

Tables

  1. State (primary key)
  2. District (primary key, state foreign key)
  3. Sub District (primary key, district foreign key)
  4. Village (primary key, sub district foreign key)

Addition

  1. Right now population info for village is acquired from GEE
  2. we may need to include all the Census data in DB

In Future:

Strong possibility of remaining relational

Scale

Analytics

In Future:

To-Do

bprashanth commented 2 months ago

This is not strictly relational, it is hierarchical. But sql is good at handling hierarchical data too. List out some common queries. If joins and aggregates are common, it might further support your decision.

The main argument for nosql is flexibility at this point. Since you don't know which data sets you need to ingest and how to work that into the model. However you need to write code to do more complex joins in nosql. More effort than a simple select statement, and slower.

If we're choosing sql, MySQL or mariadb are probably the best options. You should be able to scan docs / read forums and figure this out.

Whatever prototype you're going to try, make sure it has the right kind of data and the right kind of queries so you can make an informed decision. In this regard, a full fledged django app with the wrong data/queries will be misleading.

Also think about scale. How will you scale the db to your data set. Meaning is the entire data set you will ever need to store small enough to always fit on 1 machine? What's the cost of that machine? What's the cost with SSD, and how does that line up with the managed service cost (eg cost of rds).

Finalize your comparison criterion. It sounds like you will have a very read heavy workload. Slow writes/ingestion are ok. That means caching will help a lot. Look at that as another dimension, ease of integration of a read cache into django for a given db backend.

anmolsingh0219 commented 2 months ago
Dimension SQL (MySQL/MariaDB) NoSQL (MongoDB/Other)
Data Structure Fit Strongly relational, fits well with hierarchical data via foreign keys and joins. NoSQL can store hierarchical data, but complex queries may require additional coding and slower performance.
Scalability Vertical scaling (on a single machine) is feasible; horizontal scaling needs sharding. Better suited for horizontal scaling but may require more complex sharding strategies.
Cost Self-hosting on a VM can be cost-effective; managed services (RDS) add reliability but at a cost. NoSQL managed services can be expensive, but self-hosting can reduce costs depending on infrastructure.
Ease of Use/Django Integration Well-integrated with Django, straightforward migrations, ORM support, and existing community tools. Requires additional effort for complex queries; Django support exists but is less mature.
Caching SQL-based caching (e.g., Memcached, Redis) integrates well with Django. NoSQL can also benefit from caching, but the implementation may vary depending on the data structure.
Analytics SQL databases offer strong support for analytics through complex queries and reporting tools. NoSQL databases might offer built-in analytics but often require additional tools or custom implementations.
Backup and Reliability Managed SQL services (e.g., RDS) offer built-in backups and high reliability. NoSQL managed services also offer backups, but self-hosting may require manual setup.
Flexibility Less flexible; schema changes can be challenging. More flexible with schema changes, accommodating evolving data structures.

Important

bprashanth commented 2 months ago

Scale and flexibility seem to be the main pointers for nosql, given the above table and the fact that we already have a django app.

  1. Can you put an upper limit on your scale? Max data size in 2 years, max reads or writes per second, making educated assumptions around your user base and data ingestion.

  2. Using this estimate you can compare the cost of self hosting vs RDS.

  3. Regarding flexibility, can you list out the data sets you might want to invest in the next 2 years? Are they different from the ones you already ingest?

A. If you minimize the use of foreign keys it will keep your migrations easy.

B. If you keep your foreign keys within a certain group, it will make sharding easier. Eg all your fks are only within a given state, then you can place each states data on a new vm or db. Then depending on where you get free credits, can shard your data even across different clouds if necessary.

C. For small datasets you can even add blob fields to certain tables and store json in those blob fields. It's terrible for performance in a SQL db, but helps by not requiring you to commit to a schema before you're sure about the foreign keys.

On Wed, Aug 28, 2024, 1:03 AM anmolsingh0219 @.***> wrote:

Dimension SQL (MySQL/MariaDB) NoSQL (MongoDB/Other) Data Structure Fit Strongly relational, fits well with hierarchical data via foreign keys and joins. NoSQL can store hierarchical data, but complex queries may require additional coding and slower performance. Scalability Vertical scaling (on a single machine) is feasible; horizontal scaling needs sharding. Better suited for horizontal scaling but may require more complex sharding strategies. Cost Self-hosting on a VM can be cost-effective; managed services (RDS) add reliability but at a cost. NoSQL managed services can be expensive, but self-hosting can reduce costs depending on infrastructure. Ease of Use/Django Integration Well-integrated with Django, straightforward migrations, ORM support, and existing community tools. Requires additional effort for complex queries; Django support exists but is less mature. Caching SQL-based caching (e.g., Memcached, Redis) integrates well with Django. NoSQL can also benefit from caching, but the implementation may vary depending on the data structure. Analytics SQL databases offer strong support for analytics through complex queries and reporting tools. NoSQL databases might offer built-in analytics but often require additional tools or custom implementations. Backup and Reliability Managed SQL services (e.g., RDS) offer built-in backups and high reliability. NoSQL managed services also offer backups, but self-hosting may require manual setup. Flexibility Less flexible; schema changes can be challenging. More flexible with schema changes, accommodating evolving data structures.

— Reply to this email directly, view it on GitHub https://github.com/WELLlabs/JaltolAPI/issues/8#issuecomment-2313355571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTL5TA3SUPR2FI5GGIRSGLZTTIAXAVCNFSM6AAAAABNDNFCBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJTGM2TKNJXGE . You are receiving this because you authored the thread.Message ID: @.***>

bprashanth commented 2 months ago

RDS works within cost budget (free tier). Does RDS require an "rds instance" that costs money?

bprashanth commented 2 months ago

AI: File a bug to reevaluate RDS usage.

  1. Are we charged for the micro instance?
  2. What about after 12m?
  3. Are the planned datasets going to always fit in the free tier storage/bandwidth limits, and is the free tier forever?

Something to reevaluate in ~6m because we think the free tier is for 12m. For now we can close this issue as well.

anmolsingh0219 commented 1 month ago

Free Tier AWS offers a free tier for Amazon RDS for the first 12 months after you sign up for AWS. With the free tier, you get the following benefits:

750 hours of RDS Single-AZ db.t3.micro instance usage per month (for MySQL, PostgreSQL, MariaDB). 20 GB of storage (general purpose SSD). 20 GB of backup storage.

Data Transfer: The first 1 GB of data transfer out to the internet is free each month. Afterward, it's around $0.09 per GB (varies by region).

bprashanth commented 1 month ago

Can we afford the cost of RDS forever? (i.e after the 12 months?) If not, let's start to investigate EBS/EFS. A simple cost/benefit would help here too, just like in the ELB issue.

Cost of RDS at current read/write/gb capacity: ? Cost of EBS disk for same capacity + 0.1 human + ec2 t3.micro: ?

(let's assume you will need 0.15 human (i.e 10% of your time) or about 2 days a month to manage mysql on EBS. That could be an over estimate)

The cost difference might justify moving to ebs or staying on rds.