issues
search
abhishek-ch
/
around-dataengineering
A Data Engineering & Machine Learning Knowledge Hub
1.09k
stars
227
forks
source link
airflow
data-engineering
datascience
devops
infrastructure
machine-learning
mlops
spark
readme
A very Long never ending Learning around Data Engineering & Machine Learning
New Tech
Dragonfly is a faster Redis or Memcached alternative, that I recently tried.
Interesting Reads
How to choose a Distributed Database
Cockroach DB Architecture
Amundsen Review
Deep Dive - Foundation DB
The What, Why, and When of Single-Table Design with DynamoDB
How To Manage And Monitor Apache Spark On Kubernetes
Git is hard: screwing up is easy, and figuring out how to fix your mistakes is fucking impossible
8 Practical Use Cases of Change Data Capture
Apache Iceberg- Links
Kubernetes Port Forwarding Manager
Querying Parquet with Precision using DuckDB - Much faster compared to Pandas
What is Apache Pinot - Usecases & Architecture
Change Data Streaming Patterns in Distributedsystems
Cuckoo Hashing - An alternative to chaining and linear probing for collision handling
Riak Database
Database Indexing
Parallel Databases using Map Reduce
REST vs GraphQL
Linux Namespace & Control Group(cgroup)
SQL Lexical Structure
Everything about the Linux kernel
Weekly Digest
How #dataengineering get complicated over time
What is eBPF - Sandboxing Programs inside #linux Kernel
Absolute Basic Explanation of SSTable & Log Structured Merge Trees - Sorted String Table & Faster Random Writes
The Data Engineering
Level 0
Getting started with #dataengineering Volume 6 π
Getting started with Dataengineering Volume 5 π
Getting started with Data Engineering, volume 4 ππ‘
Getting started with Data Engineering, volume 3 ππ‘
Getting started with Data Engineering, volume 2 ππ‘
Getting started with Data Engineering, volume 1 ππ‘
Getting started with #dataengineering from basics
Apache Airflow 2.0
Some Interesting essentials while learning Apache Airflow
Dagster Release 0.10.0 - Everything about Exactly-once, Fault-Tolerant Scheduling - Extremely Important Release πππ
#getdbt or Data Build Tools interface across all major Data Workflow Management Platform π―β¨π₯
Apache Superset - An #opensource Fully Featured Business Intelligence Application πππ
The Hop Orchestration Platform, or Apache #Hop (Incubating), aims to facilitate all aspects of data and metadata orchestration π―π‘β
Apache Iceberg Partitioning is way better than Hive ! Hidden Partitioning makes everything easier! π
Trino aka #prestosql is different from Apache Spark SQL - Exclusively designed for Distributed SQL π
Apache Spark is NOT a Map but an MPP/MPI Engine
Apache Hudi - Design Principles
OpenTelemetry specification V1.0
Everything Around PySpark Pandas UDF π
Important skill-set of a Dataengineer - Reduce Cost
Everything on PyFlink - Python with Apache Flink
Delta Lake Cheat Sheet
Dataengineering schedule breakdown, a very flexible estimate
Parquet - Introduction & Design, An OpenSource File Format
SQL - Avoiding Antipatterns
Explaining Apache Kafka - In children's book format
The Perfect #dataengineering: Top INVALID Reasons behind #datapipelines failures
What is ETL
What is Proxy & Reverse Proxy
Level 1
DataEngg Skills to work with DataScience
Data Quality, A necessity for Data Driven Projects
Essential Cloud Skills for Data Engineering
Open Source Technologies in Data Engineering
Kubernetes Fundamentals Required as a Data Engineer
Apache Superset, OSS Business Intelligence for 2021
#apachekafka as a Database - Summary on both the sides , Arguments, Trade-offs & exceptional π¬ quotes β³π‘β³
Processing Guarantees in #apachekafka π―ππ - The best resource
Change Data Analysis with Debezium and Apache Pinot ππ‘πΏ
Optimizing Apache Kafka Producers & Consumers πππ
Redpanda -A NON-JVM Streaming Platform for mission critical workloads π‘ππ
Apache Hudi - Turn Batch Jobs to Incremental Model | Complete file management on a Data Lake
Apache Iceberg - an open table format for huge analytic datasets
Ballista - Distributed computing platform built primarily on Rust and powered by Apache Arrow
ZooKeeper, a distributed, open-source coordination service for distributed applications
Apache Iceberg - Partition Evolution, its simple but its so amazing
ApacheKafka without ZooKeeper Sneak Peakπ
Why Data Discovery is important for Data Engineering
Queue vs Log - Event driven Architecture
Database Indexing
Level 1.1
Multiple criteria search at scale with Apache Pinot & Theta Sketches
VM vs Containers - Similar but Different
State of Trino aka PrestoSQL
ETL is an extremely important component for any modern business
Top 5 ways to complicate a #dataengineering pipeline/application π₯
Leader election is commonly used aka Master/Namenode/Leader/Driver
Dagster vs Airflow - A comparison
About Single Source of Truth in DataEngineering
Change Data Capture for Distributed Databases
Deep Dive on Why Apache Iceberg for Change Data Capture, using Apache Flink π
OpenMetadata is an Open Standard for Metadata. A Single place to Discover, Collaborate, and Get your data right
About Lakehouse
etcd - A distributed, reliable key-value store for the most critical data of a distributed system
What is Redis
What is Hive
What is Data Warehouse - An Introduction
Fundamentals of Designing Data Warehouse
Database Relational Model - A way of looking at Data
Data Engineering Infrastructure Notes
Dataengineering Core
A Data Engineering Story - The Beginning
Data Engineering - More towards Data Science or Data Analytics or ...
Data Engineering Interview Patterns
Basic Checklists while learning Apache Spark
#apachespark for Distributed Analytics or #businessinteligence Platform - Worth or not ?
Apache Beam for Search: An Introduction & Addressing the challenge of the Time Problem ππ‘π
Nextflow is a Workflow Manager exclusively for #bioinformatics π©Ήππ©Ή
#apachespark Project Zen Update - Making PySpark Better π‘ππ‘
Design - Exactly Once Delivery & Transactional Messaging in #apachekafka πππ
underrated but important skill of a Data Engineer
Fallacies of Distributed Systems
As a Data Engineer, some Essentials I did which really helped Data Scientists and the Team
A very normal Data Engineering work π
What can go wrong in Distributed Data Systems
Architect and build an #machinelearning use case end to end using Amazon SageMaker π
Around Data Discovery or Metadata Management Platforms
Amazon S3 Object Lambda - Provide Different Views of Data to Multiple Applications
Full Stack Data Engineer
Data cleaning is Hard but why
Most exciting things about #dataengineering
The real impact of Disks on #rocksdb State Backend in Apache Flink
Tips for Distributed System High Availability
interesting way of collaboration between a Dataengineer & Datascientis
Building DistributedLog: High-performance replicated log service π
Whiz: Data Analytics Execution Framework based on Intermediate Data
Adding unlimited Nodes in a #dataengineering platform will eventually drop
A typical Data Engineering Pipeline
'Log' is a fundamental component of a Data Engineering Ecosystem
Flink CDC
Readings Around Databases
Code Review Best Practice, bcz Developers, hate code reviews
Important Performance Criteria to measure DataEngineering Systems
Database Internals - Storage
Data Integration for Databases & Data Warehousing - An Introduction
What is Protocol Buffer - An excellent important data interchange format for serialization, "Zero Copy" format
Memcached, Redis & Elasticache - To accelerate your data or databases
What is LSM-Tree
Tor aka Onion Router - How does it work?
Infrastructure
SQL Database on Kubernetes - Best Practices
Devtron - An Open Source DevOps on Kubernetes, written in Go π₯ππ
Most Popular #opensource BI & Data Analytics Platforms ππ‘π
datapipelines Dataframe APi is now available with #apachebeam π―π₯π―
Disaster Recovery for Multi-Region Apache Kafka & Data Consumption using #apacheflink π ππ
Kubernetes Api Structure π―βοΈπ―
Architecting a Kubernetes Infrastructure π―
Exploring Kubernetes Operator Pattern π‘
Docker is an interal part of Data Engineering ML pipeline & that makes security π extremely essential
Rack awareness for #apachekafka Streams Proposal π
Dolt is Git for Data π
Toward Better Data Culture From First Principles by Ube
Fast and Reliable Schema-Agnostic Log Analytics Platform by Uber
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systemsπ
Diving Deep on S3 Consistency - Insightful
Ray- General Purpose ML Infrastructure
Kubernetes Hardening Guide by National Security Agency
Everything around Load Balancer
Data Lakehouse - Is it really the end of Datawarehouse
Real-Time Exactly-Once Processing with Apache Flink, Kafka, and Pinot @uber
WTH is Kubernetes Operator? - An Introduction
Lessons Learned from Sharding Postgres
What is Kubernetes- An introduction
ELK Stack - Introduction of Scalable Monitoring
What is NGINX - An Introduction
What is Load Balancer - An Introduction
What is OAuth 2 - Introduction|API Based Authorization
Kubernetes & Networks - It's hard because multiple options are available
Kubernetes Reconciliation
Troubleshooting Kubernetes App
Kubernetes Best Practices - Classics
Paper: Serverless Computing: A Survey of Opportunities, Challenges, and Applications
Choosing the Kubernetes Local Cluster
Monitoring Kubernetes - Fundamentals of #kubernetes Infrastructure Monitoring
Kubernetes Controller Manager
Kubernetes: Why the Pod is still in the Pending State?
Kubernetes Liveness & Readiness Probe
Kubernetes Pod/Node Affinity
SQL
Advanced SQL - Reference CS 564 Database Management Systems
SQL and Advanced SQL - An asset
Database Indexing - Almost Everything
Tuning SQL queries - Tips for writing efficient & faster Queries
Database Schema Design - Schema Design is a Complicated Necessity
SQL Query Processing Plan - Basics
Revisiting SQL Basics - The beginning of Data Science & Data Engineering
Distributed Advanced Queries - Presto/Trino
SQL Notes For Professionals, 100+ pages
Table Partitioning
SQL complex Queries - Nested Queries & Aggregation
Gossip Protocols - Designed for Data Consistency & Fault-Tolerance
Table Partitioning - An Important Concept
Database Concurrency Control : 2 Phase Locking
Database Entity Relationship Model
SQL Join Fundamentals
Database Indexing
Database Indexing Notes
SQL Injection Introduction
SQL Constraints Fundamentals
The fundamental of writing SQL queries is different from
Building a NoSQL Database using Git
Against SQL - An article on What is not good with #sql
Using
EXPLAIN
for Data Problems - Things beyond SQL
10 SQL Queries to Blow Your Mind π
Views, Stored Procedures, Functions & Triggers - SQL
SQL Transaction & ACID Property
How to Solve complex SQL queries
Apache Spark SQL - The Introduction from RDMBS till SparkSQL
Advanced SQL & Functions
Basic & Intermediate on Database Sharding
Complex Database Queries with PostgreSQL
Query Evaluation - Technical Details "when you execute SQL Query"
What is Materialized View & how does work in Distributed Databases
Breaking Down NoSQL Sharding, Replication & Consistency
Database Query Optimization Technique
Intermediate SQL
SQL Stored Procedures
OLAP & OLTP - Datawarehouse Data Mining
Database Fundamentals
SQL Subqueries
NewSQL Introduction - Basic to Intermediate
SQL Intermediate & basics Deep Dive
SQL Basics - The Starting point
Data Warehousing & OLAP Technology
Snowflake Datawarehouse
RelationalAlgebra & SQL
Logical Schema Design: SQL Database
Kubernetes Pod Internals - Deep Dive
The Illustrated Children's Guide to Kubernetes
SQL Subqueries by Example
What is Write-Ahead-Logging (WAL)
[SQL Transactions](SQL Transactions - a sequence of database operations)
Linux Productivity Tools - This is a Data Infrastructure necessity
NoSQL
[NoSQL & MongoDB]
https://www.linkedin.com/posts/iamabhishekchoudhary_nosql-mongodb-activity-6874231633654935553-Z66u
)
CouchDB Introduction - β’ Document Storage Database
Machine Learning
MLOPS
Machine Learning Workflow :100:
Dummy Notes On Machine Learning Infrastructire
Machine Learning Feature Store :100:
Deploying #machinelearning model in Production is really HARD but #MLOps can fix that.
List of #machinelearning & #dataengineering Technologies will be following in 2021 ππ‘π
MLOps - ZenML #machinelearning with reproducible pipelines β π―β
Why? Data Versioning is a complicated problem for Dataengineers
Explainable AI Cheat Sheet
Designing Machine Learning infrastructure
What is Log - Foundation behind Databases & Distributed Systems
How does the GIT version control work?
Project
Streamlit Healthcare Machine Learning Data App
Dstack AI - An open-source tool to develop data applications with Python πππ
Adversarial Robustness Toolbox - a Python library for #machinelearning Security π‘ππ
Biopython is a set of freely available tools for biological computation written in #Python πβοΈπ
Insightful
Time to Know More about DASK
DataEngineering vs Machine Learning
A good #machinelearning Model is only possible with a good quality of #data. βοΈ
Statistics for #softwareengineer π₯π―π₯
Monitoring #machinelearning Applications ππ π
Dagster is a data orchestrator for machine learning, analytics, and ETL - Officially #machinelearning driven π₯π₯π₯
Short Notes on -Open source #machinelearning Tracking System
The best example of Randomness is - #machinelearning model in Production. πππ
Flyte is declarative, structured, and highly scalable cloud-native workflow orchestration platform for Distributed Machine Learning
Tips for Distributed System High Availability π
Building DistributedLog: High-performance replicated log service π
How to scale Kubernetes with Assurance
Apache Calcite - Building Sql Query Processor from Scratch over Lucene
Database Storage
ACID is the foundation of Database, BASE is for NoSQL Databases
Some common elements behind many Distributed Databases
Failure Recovery in TrinoDB
What is LLVM
What is Garbage Collection
What is Canary Deployment
Paper
Distributed System
Crazy
The Snowflake Paper - Core idea is to build an enterprise-ready #datawarehouse solution for the #cloud ππ°π
Most important points around Distributed #dataengineering Platform
Fundamental of #distributedsystems Scaling - Avoiding Co-ordination πβ¨οΈπ
Technical Debt in #dataengineering #softwareengineering ππ‘π
Paper on Wander Join: Online Aggregation via Random Walks πππ Join problem
The Delta Lake Paper - High-Performance ACID Table Storage ππ‘π
Dynamo - AWS Highly Available Key-value Store #distributedsystem π¬π‘π
An Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables, A Single SQL for all π‘π©π©
Secure & Robust Machine Learning in #healthcare ππ§ͺπ₯³
Progress in Medical Science using #deeplearning ππ‘π
The Amazon Redshift Paper - A fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing #businessintelligence tools ππ°π
Advancing #drugdiscovery via Artificial Intelligence ππ₯π₯
Apache Calcite is a dynamic data management framework πππ
Lakehouse - A Paper on new Generation of #datawarehouse technology π‘ππ‘
Calvin: Fast Distributed Transactions for Partitioned Database Systems ππ
Presto or Trino - #SQL on Everything ( The Design, Motivation & Performance) #presto πππ‘
Design - Exactly Once Delivery & Transactional Messaging in Apache Kafka
Apache Kafka Paper : Distributed Messaging System for Log Processing
Paper: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size
Paper: Ground is an open-source data context service, a system to manage all the information that informs the use of data
Azure Data Lake Store(ADLS) is a fully-managed, elastic, scalable, and secure file system that supports #hadoop distributed file system (HDFS) and Cosmos semantics
An LFU (Least Frequently Used) Cache eviction algorithm of O(1) Runtime complexity
The Berkeley View on Cloud Computing - Paper
The Google File System - The Paper π
Paper: Report on Distributed Deep Learning on Data Systems π
Crystal: A Unified Cache Storage System for Analytical Databases
VoltDB
Magnet - Apache Spark Shuffle mechanism to handle petabytes of daily shuffled data and clusters with thousands of nodes
Set 2
Paper: Real-time Data Infrastructure @ Uber
Paper: DBLog, A Watermark Based Change-Data-Capture Framework by Netflix
Paper: Large Scale Distributed Systems Tracing Infrastructure
Paper:Paxos vs Raft: Distributed Consensus π
Paper: Sorting in a #distributedsystem π
Paper: A large scale analysis of hundreds of in-memory cache clusters
Design & Architecture of Amazon Timestream - Streaming at Scale
Distributed System Synchronization
Paper: Consistent hashing - Resizing cluster or Load in a #distributedsystems with a simple concept
Deep Dive - Foundation DB (unbundled database, OLTP, strict serializability, multi-version concurrency control, optimistic concurrency control, simulation testing)
Distributed Database - ZippyDB is the largest strongly consistent, geographically distributed key-value store at Facebook Database
BigData Metadata Management System
Machine Learning for Database Optimizations
SingleStore - A Distributed Database Management System. It's really more than a Database
ArrowSAM, in-memory genomics SAM format based on Apache Arrow
Realtime Data Processing FB - Deep Dive on #streamprocessing
ArangoDB - Native multi-model NoSQL Distributed #database, From #sql to NoSQL
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
How to bring robustness while Designing Large Scale Complex Systems
Facebook Datawarehouse
Building a performant OLTP system on an open-source columnar format, and supporting near-zero overhead data export to external tools
Towards Demystifying Serverless Machine Learning Training
Paper: Scalable Linear Algebra on top of Distributed Databases, this will simplify Machine Learning on Databases
Paper: Are You Sure You Want to Use MMAP in Your Database Management System
What is RBAC or Role-Based Access Control
Vectorization vs. Compilation in Query Execution
SQLite vs DuckDB
Advanced
Glow is an open-source toolkit for working with genomic data at biobank-scale and beyond using #apachespark & #deltalake πππ
ExPASy - Databases and software tools in proteomics, #genomics, phylogeny, systems biology, evolution, population genetics, and transcriptomics π‘ππ
What is Metadata - A Data Engineering necessity
What is Distributed Database
To Partition, or Not to Partition, That is the Join Question in a Real System
Paper: Solana- A new architecture for a high performance blockchain-inspired by Distributed Systems
Scaling Large Production Clusters with Partitioned Synchronization
Paper: Volcano Operator Model is based on relational algebra
Paper: Faster and Cheaper Serverless Computing on Harvested Resources
DBOS: A Paper on DBMS-oriented Operating System
SSD Storage - Scale Caching without increasing too much cost & Smart Indexing for faster data query
Paper: Lineage Tracing for General Data Warehouse Transformations
What Every Programmer Should Know About Memory
Deployment Archetypes for Cloud Applications
PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers
Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask
Dual use of artifcial-intelligence-powered drug discovery
Discussions
Should you pick Managed Service or build self Managed Open Source Infrastructure
What is Sigstore
Security Threat Model
Kubernetes Security & Secrets
2 ways of Data/ML Product Development
Basics
What is Compiler - Programming language Processor
Understanding Raft Consensus
How does SSH (Secure Shell) Work
Operation System Memory Management - Why Do you even need Virtual Memory