What Goes Around Comes Around (Stonebraker, 2006)

This is reading from the first chapter of Redbook

Stonebraker summarizes 35 years of data model proposals, grouped into 9 eras.

Hierarchical (IMS) : strict single parent
- tree structured data models -> limited data models can be represented.
  - Limited logical data independence.
  - Suffer from existence problems: representing entity that has no parent
- Tuple-at-a-time query language -> manual query optimization (hard)
  - Limited physical data independence (or high complexity to enable)
- DBs with main focus on XML and JSON has parallels to IMS.
Network (CODASYL) : allows multiple parents
- Suffer from complexity
- Not flexible enough: suffer to represent three-way relationships.
- Tuple-at-a-time query language.
- Poorer data independence compared to IMS.
Relational
- Rooted in relational algebra, motivated by better data independence.
- High level (Set-at-a-time) query language -> improved physical data independence
  - Query optimizers can beat all but the best tuple-at-a-time DBMS application programmers.
- Simple data model -> easier logical data independence. KISS instantiation.
- QUEL, which many has pointed out to be a better (follows Codd's relational algebra proposal more closely, hence is more composable) query language, loses out to SQL because IBM, the elephant of the market at the time, decided to use SQL and other vendors followed.
- Early systems include IBM's System R and Berkeley's Ingres.
Entity-Relationship
- Never took off as underlying data model for DBMS.
- Very useful for database design, i.e. determining the first set of tables, and easily achieve 3NF.
  - This is because functional dependencies are too difficult. KISS instantiation.
Extended Relational (R++): extend relational to specific application.
- Proposed query language extensions, but didn't achieve any big performance / functionality advantage to take off.
  - aggregation, but this can be easily achieved via JOIN.
  - generalization, but this can be achieved via PK-FK constraint
Semantic Data Model:
- Proposed classes and multiple inheritance to generalization query language extensions proposed by R++. Faced similar problems as R++.
Object-Oriented
- Motivated to remove data loading/unloading from the database. However, it returns to essentially tuple-at-a-time query language.
- Never took off as its focus is on engineering databases, a niche market compared to business data processing.
Object-Relational (OR)
- Successfully proposed user-defined {data types, operators, functions, access methods} (UDTs and UDFs) to extend database capability for other markets
  - It is motivated by optimizing 2D access methods in GIS applications.
  - This puts code in the database (blurring the distinction between code and data)
- Postgres is the major OR research prototype.
Semi-structure (XML)
- schema-last is useful for semi-structured data, which are often entered as a text document and parsed to find information of interest.
  - Pure text data is handled by IR systems, and rigidly structured data is better handled by relational "schema-first" systems.
- XML data model suffers from complexity.

Lessons:

Tuple-at-a-time query language is bad: hard to optimize and makes physical data independence harder.
Unless there is a big performance or functionality advantage, new constructs will go nowhere.
Simple systems last longer. KISS
Widespread adoption requires either standards / elephants of the marketplace pushing hard. Similarly, technical debates are often settled by elephants, and often not for technical reasons
There are only a few noticeably new data model ideas over the years, namely UDFs and schema-last. Other proposals either fail or is absorbed into "SQL over tables" relational data model.

benclmnt / papers

What Goes Around Comes Around (Stonebraker, 2006) #9