benclmnt / papers

Summary of my readings
1 stars 0 forks source link

What Goes Around Comes Around (Stonebraker, 2006) #9

Open benclmnt opened 1 year ago

benclmnt commented 1 year ago

This is reading from the first chapter of Redbook

Stonebraker summarizes 35 years of data model proposals, grouped into 9 eras.

  1. Hierarchical (IMS) : strict single parent
    • tree structured data models -> limited data models can be represented.
      • Limited logical data independence.
      • Suffer from existence problems: representing entity that has no parent
    • Tuple-at-a-time query language -> manual query optimization (hard)
      • Limited physical data independence (or high complexity to enable)
    • DBs with main focus on XML and JSON has parallels to IMS.
  2. Network (CODASYL) : allows multiple parents
    • Suffer from complexity
    • Not flexible enough: suffer to represent three-way relationships.
    • Tuple-at-a-time query language.
    • Poorer data independence compared to IMS.
  3. Relational
    • Rooted in relational algebra, motivated by better data independence.
    • High level (Set-at-a-time) query language -> improved physical data independence
      • Query optimizers can beat all but the best tuple-at-a-time DBMS application programmers.
    • Simple data model -> easier logical data independence. KISS instantiation.
    • QUEL, which many has pointed out to be a better (follows Codd's relational algebra proposal more closely, hence is more composable) query language, loses out to SQL because IBM, the elephant of the market at the time, decided to use SQL and other vendors followed.
    • Early systems include IBM's System R and Berkeley's Ingres.
  4. Entity-Relationship
    • Never took off as underlying data model for DBMS.
    • Very useful for database design, i.e. determining the first set of tables, and easily achieve 3NF.
      • This is because functional dependencies are too difficult. KISS instantiation.
  5. Extended Relational (R++): extend relational to specific application.
    • Proposed query language extensions, but didn't achieve any big performance / functionality advantage to take off.
      • aggregation, but this can be easily achieved via JOIN.
      • generalization, but this can be achieved via PK-FK constraint
  6. Semantic Data Model:
    • Proposed classes and multiple inheritance to generalization query language extensions proposed by R++. Faced similar problems as R++.
  7. Object-Oriented
    • Motivated to remove data loading/unloading from the database. However, it returns to essentially tuple-at-a-time query language.
    • Never took off as its focus is on engineering databases, a niche market compared to business data processing.
  8. Object-Relational (OR)
    • Successfully proposed user-defined {data types, operators, functions, access methods} (UDTs and UDFs) to extend database capability for other markets
      • It is motivated by optimizing 2D access methods in GIS applications.
      • This puts code in the database (blurring the distinction between code and data)
    • Postgres is the major OR research prototype.
  9. Semi-structure (XML)
    • schema-last is useful for semi-structured data, which are often entered as a text document and parsed to find information of interest.
      • Pure text data is handled by IR systems, and rigidly structured data is better handled by relational "schema-first" systems.
    • XML data model suffers from complexity.

Lessons:

benclmnt commented 5 months ago

From Kleppmann's Designing Data Intensive Applications page 37

With both the hierarchical and the network model, if you didn’t have a path to the data you wanted, you were in a difficult situation. You could change the access paths, but then you had to go through a lot of handwritten database query code and rewrite it to handle the new access paths

... joins on foreign keys are performed at query time, whereas in CODASYL, the join was effectively done at insert time.