andrii-itdev / InsightFlow

Other
0 stars 0 forks source link

(Research) Determine what databases & architectures are used by global companies #19

Open andrii-itdev opened 1 month ago

andrii-itdev commented 1 month ago

What Databases do THEY use?

andrii-itdev commented 1 month ago

Netflix

how-netflix-adopted-nosql netflixtechblog netflix-media-database netflix-at-cockroachdb

Director of Cloud and Systems Infrastructure Yury Izrailevsky explains how and why Netflix migrated some of its systems to NoSQL. “In the distributed world governed by Eric Brewer’s CAP theorem, high availability (a.k.a. better customer experience) usually trumps strong consistency,” he writes. ” There is little room for vertical scalability or single points of failure.”

Netflix uses three NoSQL tools: SimpleDB, HBase, and Cassandra. “The reason why we use multiple NoSQL solutions is because each one is best suited for a specific set of use cases,” Izrailevsky writes. He writes that the learning curve has been steep and re-architecting the company’s systems has been difficult but “the scalability, availability, and performance advantages of the NoSQL persistence model are evident and are paying for themselves already, and will be central to our long-term cloud strategy.”

SimpleDB is highly durable, with writes automatically replicated across availability zones within a region. It also features some really handy query and data format features beyond a simple key/value interface, such as multiple attributes per row key, batch operations, consistent reads, etc.

Netflix uses HBase because it’s deeply integrated with Hadoop. Izrailevsky writes that the biggest advantage in using HBase is the ability to “combine real-time HBase queries with batch map-reduce Hadoop jobs, using HDFS as a shared storage platform.” He notes, however, that with HBase the company does have to sacrifice some availability for consistency.

Netflix uses Cassandra for its scalability and lack of single points of failure and for cross-regional deployments. ” In effect, a single global Cassandra cluster can simultaneously service applications and asynchronously replicate data across multiple geographic locations.”

Users Service

Users Service would be mainly responsible for user authentication and profiles. This service would persist the data in a relational database like MySQL or PostgreSQL. We need strong ACID properties for the set of data we have and hence RDBMS is a suitable choice.

Subscriptions Service

Subscriptions Service would be used to manage the subscription of the users. Since data processed by this service are highly transactional in nature, RDBMS makes a suitable choice.

andrii-itdev commented 1 month ago

GitHub

andrii-itdev commented 1 month ago

Dropbox

Medium - Unraveling Dropbox Magicc Pocket Stack Share Medium - System Design of Dropbox System Design of Dropbox

Dropbox’s Databases: Where the Real Magic Happens

Storing files is only one part of the puzzle. Dropbox also needs to know who owns which file, who has permission to view or edit it, and how to track versions of the file. This is where databases come in.

Dropbox uses a range of databases, including MySQL for structured data and Edgestore, its internal key-value store, for handling file metadata. These databases work like massive spreadsheets, keeping tabs on every piece of information that passes through the system.

Dropbox uses a variety of technologies in its stack. At the core of its infrastructure, Dropbox uses Python for server-side code, along with other technologies such as Go, Rust, and Swift. For storage, Dropbox initially used Amazon S3 but later built its own distributed storage system called Magic Pocket. On the client side, Dropbox uses a mix of programming languages and technologies to support various platforms, including C++, Objective-C, and Java for Android. Additionally, Dropbox utilizes a range of open-source tools and frameworks to support its infrastructure and services.

Dropbox has verified its tech stack on StackShare, and, in terms of applications and data, Dropbox uses Nginx, MySQL, Python, Memcached, Amazon s3, Rust, and Hadoop.

Found an interesting talk by Kevin Modzelewski from Dropbox outlining the initial technology stack and the tradeoffs that DropBox made for fast growth. It outlines the initial high level architecture, covering their

For those that are curious about what we chose and why, the software we used was:

System Design highlights

The Client and Queue: Orchestrating File Uploads and Updates On the client-side, we’d have a ‘chunker’ that breaks large files into smaller chunks for efficient upload, an ‘indexer’ that monitors changes in the file system, and a local database storing metadata about each file.

Synchronization Server: Ensuring Consistency Across Devices The Synchronization Server fetches metadata updates from the queue and updates the server’s database, providing a consistent view of the file system across all devices.

The App Server: Interfacing with the Clients An Application Server (App Server) plays a crucial role in handling client requests and responses. It could expose APIs for various functionalities such as user authentication, file upload and download, file metadata retrieval, sharing files or folders, and more.

Load Balancer: Distributing the Load With millions of users worldwide, the system must efficiently distribute the incoming network traffic.

Edge Server: A Powerful Facade for Enhanced Performance An Edge Server plays a vital role in the system by serving as a robust facade for database interactions. It is essentially a wrapper around MySQL databases that provides APIs for various database operations. Internally, the Edge Server leverages an Object-Relational Mapping (ORM) tool. An ORM tool helps in interacting with the database in an object-oriented manner. The Edge Server is also equipped with a caching mechanism. It stores frequently accessed data, which significantly reduces the load on the database and decreases latency, leading to improved performance.

CDN: Speeding Up Content Delivery To further enhance the user experience, especially for geographically dispersed users, we could employ a Content Delivery Network (CDN). CDNs store cached versions of content in edge locations close to the user, resulting in reduced latency and faster content delivery. For instance, using Amazon CloudFront as a CDN would seamlessly integrate with our S3 storage.

Conclusion

The Choice of Database: MySQL over NoSQL

While NoSQL databases are known for their scalability, they aren’t the best fit for our use case because of their lack of strong consistency guarantees. A relational database like MySQL, combined with a database abstraction layer like Dropbox’s Edgestore, allows us to overcome scalability limitations while benefiting from the robustness and flexibility of SQL databases.

Meta Service is backed by Metadata DB.

This database contains the metadata of file like name, type (file or folder), sharing permissions, chunks information etc. This database should have strong ACID (atomicity, consistency, isolation, durability) properties. Hence a relational database, like MySQL or PostgreSQL, would be a good choice.

Since querying the database for every synchronization request is a costly operation, a in-memory cache is put in front of Metadata DB. Frequently queries data is cached in this cache thereby eliminating the need of database query. This cache can be implemented using Redis or Memcached and write-around cache strategy can be applied for optimal performance.

Block Storage Block Storage can be implemented using a distributed file system like Glusterfs or Amazon S3. Distributed file system provides high reliability and durability making sure the files uploaded are never lost. When Dropbox started, they used S3 as block storage. However as they grew, they developed an in-house multi-exabyte storage system known as Magic Pocket. In magic Pocket, files are split up into blocks, replicated for durability, and distributed across data centers in multiple geographic regions.

andrii-itdev commented 1 month ago

Slack

andrii-itdev commented 1 month ago

Asana

andrii-itdev commented 1 month ago

YouTube

andrii-itdev commented 1 month ago

GitLab

andrii-itdev commented 1 month ago

Twitter (X.com) Threads

andrii-itdev commented 1 month ago

Discord

andrii-itdev commented 1 month ago

Instagram

andrii-itdev commented 1 month ago

Linkedin