Hi @Cleop, thank you for opening this issue and summarising your knowledge quest! 🎉

Stoked that you found the Example created by @Danwhy clear and followed it on your localhost. 🥇

Answers to your Questions (above)

[ ] "Physical vs Logical?" >> Logical. We are storing all data "CRUD" operations as new rows in a Postgres Database table and then (much later in the lifecycle of the product) computing a "View" https://en.wikipedia.org/wiki/View_(SQL) on that data to optimise SELECT queries.
This is way too much detail (complexity) for a beginner to worry (or even know/think) about, but the curious can read:
- https://en.wikipedia.org/wiki/Physical_data_model
- https://en.wikipedia.org/wiki/Logical_data_model
[ ] All logs are time-series by definition. If a record is stored with a timestamp, it's a time series. Perhaps a useful clarification/distinction from the Wikipedia article https://en.wikipedia.org/wiki/Time_series_database (which is kinda useless/confusing) ... 😞 Any data can be stored as time series and most ("big") data is!

A "beginner-friendly" (+funny!) intro to "Big Data": https://www.bbc.co.uk/programmes/b0b9wbf8 In the Apps we're building, we're using a UUID as the Primary Key (PK) to avoid write conflicts. Even with Erlang's Timestamp precision being microseconds anything more than 50 DB writes per second would result in an unacceptably high chance of PK collision. 💔 Timestamps in Elixir: https://michal.muskala.eu/2017/02/02/unix-timestamps-in-elixir-1-4.html
[ ] "Are all logs append only?" >> Yes, all logs worth knowing/using are append-only. If logs are anything other than append-only they are a source of chaos/confusion. 😿
Good "further reading" on this topic:
[ ] "What is a record in the context of a log? Is it the equivalent of a row in a table?" >> Yes. All records are rows. All rows have column headings. Most fields after the essential are optional. For example in an Address Book there's no point having an empty row (i.e. if all are fields optional, blank records would be "OK", this is obviously undesirable), so the minimum data acceptable for a new Address Book entry is the person's Name; everything is optional and can be added on subsequent writes/updates.
[ ] "Do all logs run horizontally?" >> Don't confuse yourself with the direction of "travel". All Time is linear and uni-directional. If in any doubt, logs run Vertically Top-to-bottom like reading a "page".

Think about how you like to scroll a screen/phone using your keyboard, mouse or finger: How often do you scroll horizontally? (almost never. it's a UX anti-pattern) By contrast scrolling vertically is a natural UX. There might be the occasional timeline in a history book that is horizontal for aesthetic reasons. But it almost always means they have "truncated" the data points to fit on the page. Almost no computer systems are horizontal-scrolling. Note: horizontal scrolling is not the same as "swiping" on a mobile; the go-to UX for all hipster apps: https://uxplanet.org/horizontal-scrolling-in-mobile-643c81901af3
[ ] "Is a timestamp value mandatory in a record in a log?" >> Yes, by definition. And thankfully, Phoenix/Ecto already stores timestamps: inserted_at. 🎉
- [ ] Do they act as the unique identifier for records or does a numeric id? >> No.
  
  Timestamps should never be used as a unique identifier in any serious system. The Primary Key (PK) should be Universally Unique (_hence using using a UUID_) i.e. virtually zero chance of conflict. Using a Timestamp would almost always guarantee conflict where writes-per second are greater than 1k. In a modest "Chat" application with 100k concurrent users, 1k writes/sec are "normal". (each person sends one message once every couple of minutes...) See: http://www.internetlivestats.com
[ ] "horizontal scaling partitions useful?" >> 😄 (What will you do when you win the lottery?) (it's good that you are keen! but you are over-thinking this ... com back to this question in 2020!) Thinking about Horizontal Scaling is way beyond the "scope" of this example/tutorial. To be clear: we are using an Append-only Log for reliability, accountability and record rollback-ability, not "scalability" (i.e. "premature optimisation"). Scalability is a distant future benefit, not a focus.

If you (or anyone else) are personally interested in Scalability you should read: https://stackoverflow.com/questions/11707879/difference-between-scaling-horizontally-and-vertically-for-databases Using an Append-only Log with UUIDs as PKs is all the "ground work" we need to ensure that anything we build is prepared to scale both Vertically and Horizontally. ✅ 🚀 When any of our apps reaches 10k writes/sec we will be insanely "successful". 🦄 🎉 An AWS RDS db.m4.16xlarge instance has 256GB of RAM and can handle 10GB of "throughput". It's been benchmarked at 200k writes/second ... if we ever need to use one of these instances, we'll all be sipping coconut water on the shore of the @dwyl island! 🌴☀️🍸 🏄‍♀️ 🐢 🐟 ⛵️
[ ] 'distributed data systems' - "if a log is a move away from them" ...? >> Huh...? 😕 Logs are the basis for all distributed/fault-tolerant systems. That's all we need to know/say.

Please avoid adding more terms/complexity to the readme than are strictly necessary; anything more than "append-only log" will automatically "confuse" people who new to this. If you are curious about understanding this in way more detail, read: https://raft.github.io and watch @substack's talk on "What you can build with a log": https://youtu.be/RPFjN1N148U For the purposes of all developers using Phoenix, just trust PostgreSQL to handle your data. You're in "good company" https://github.com/dwyl/learn-postgresql/issues/31 And if our app is lucky enough to be "successful", look into CitusDB

There is no need to confuse beginners with the term "write-ahead log" https://en.wikipedia.org/wiki/Write-ahead_logging ("TMI"); it's just an append-only log. Hopefully people will understand the words: "append", "only" and "log" ... 💭 (if not, they probably aren't "ready" for this example/tutorial ...)

In general/practice, log compaction is never needed for most companies/products these days. Data storage costs are so cheap (see detail below) that every App developer can easily afford to store all data generated by their App - provided the app does not store arbitrary data for no reason. In most companies, data is the "life blood" of the strategic decision making, the entire field of "BI" https://en.wikipedia.org/wiki/Business_intelligence relies on having as much data as possible.

Aside: Never Over-write or Delete Data it is your Most Valuable Asset!

I feel "qualified" to assert that using Append-only Logs for everything in a company are the single best technology decision that can be made. Logs are the foundation for the building. I helped setup "BI" at Groupon, no data was ever deleted or compacted, only periodically "warehoused" to save on storage cost, but still accessible via Hadoop. Per employee, we were the single biggest source of additional revenue, cost savings and value creation in the company. A "department" of 8 people generated $20M of annual value for the company by spotting trends in markets, products & buying behaviour. 📈 Hedge funds (_like Bridgewater or Citadel_) use data analysis to multiply cash. It's a fascinating world/activity that generates no (NET) value to society but makes a handful of people phenomenally wealthy! https://en.wikipedia.org/wiki/The_rich_get_richer_and_the_poor_get_poorer

If you're ever interested in learning how some Mathematicians use their skills to print money, read: https://en.wikipedia.org/wiki/The_Quants (I think we still have a copy in the @dwyl library...) Or if you're short on time or prefer to watch the dramatised version, The "Big Short" is good. 📺 If you haven't seen it, add it to your list: https://www.imdb.com/title/tt1596363 Or [Spoiler Alert] just watch the ending: https://youtu.be/Bu2wNKlVRzE The most insightful part of the film is in the end credits. Michael Burry is investing in Water: If you or anyone else has "spare capital", to do the same; invest in water (recovery/treatment). (_not to profit from gouging the poor! but water treatment startups/ideas/tech are "win-win"_)

tl;dr

Some Data gets Updated, Most Never Does

Many developers have an "observation bias" toward data changing (mutability) because we often change the data we interact with and we therefore think that mutability is the "norm" but it really is not. Once people realise the fact that most data gets created/written once and never gets updated. Append-only logs are (unsurprisingly) common in virtually every area of computing.

Even when data appears to be mutable (from the User's perspective) it is often stored in an immutable log so the underlying store is immutable but the UI/UX appears that data is being mutated because the UI only displays the latest version of the data.

In many popular web frameworks, preserving "history" of a record (or piece of content) is often done with a "history table" which stores copies of the "previous" versions of data whenever a record is updated.

Examples in Every Industry

Examples of use-cases where Append-only Logs are useful were included in the original "Why? What? How?" issue: https://github.com/dwyl/phoenix-ecto-append-only-log-example/issues/1

To those examples we can add:

Banking/Financial transactions are all append-only (write once) ledgers. If they were not, the all accounting would be chaos and the world economy would collapse. When the "available balance" of an account is required, it is calculated from the list/log of transactions. (a summary of the data in an account may be cached in a "view" but it is never mutated)

You may recall at the End of your F&C course you were applying to work for a company and they asked you to do a coding challenge/exercise that involved computing the values of a bank account. We paired on solving that challenge using a Log (debit and credit ledger). Data was never mutated. The account balance was always calculated "just in time" by "traversing" the Log.
Healthcare: a patient's medical data gets captured/recorded once as a "snapshot" in time. The doctor or ECG machine does not go back and "update" the value of the patients heart rate or electrophysiologic pattern. A new value is sampled at each time interval.
Analytics is all append-only logs which are (time) series of events streamed from the device to server, saved in a time-series data store, and streamed (or "replayed") to visualisation dashboard.
- Events in Analytics systems are often aggregated (using "views") into charts/graphs. The "views" of the data are "temporary tables" which store the aggregated or computed data but do not touch the underlying log/stream.

Fact is: that append-only logs while not often mentioned by name, are in fact the norm. Think of an application that stores data of any kind, and you can see how it can either benefit from (or is totally dependent on) data immutability.

Everything is always an event. Everything in the universe happens once. Time is linear and uni-directional. Even in the "multiverse hypothesis", time is still uni-directional, we cannot (no theory in quantum mechanics allows us to) go back in time; Mutating data is "time travelling"; it's updating data that was created in the past and destroying the history/timeline. In a distributed system mutating data means unpredictability/unreliability.

Address Book Example

In the example given in the README.md of an Address Book, we illustrate how a person can "update" their address when they move home, however (in the example) their previous address does not get over-written, we simply insert their new address into the database and the new address is treated as the "current" version. That way the "Address Book" has a complete record of the history. This is a good use of an append-only log because address data will almost always change over the lifetime of a person, but knowing previous addresses is useful (and in some cases essential).

If Thor decides to leave his parents' house (planet) His "old" address for reference:

Thor
The Hall, Valhalla, Asgard
AS1 3DG
0800123123

thor-old-address and crashes on his buddy Stephen's couch till he gets his own place in NY. His _new (temporary) address will be:

Thor Odinson c/o Dr. Strange
177A Bleecker Street,
New York, NY 10012, USA

thor-new-address

Thanks @mathiasbynens for this handy character/byte counter: https://mothereff.in/byte-counter 🥇

The average person in the United States is expected to move 11.4 times in their lifetime: https://fivethirtyeight.com/features/how-many-times-the-average-person-moves

Historical Context of Mutable Data: It Was Too Expensive to Store Everything

To understand why many Web Application frameworks (still) over-write data when an update is made, we need to look back to when data storage was expensive. At the dawn of digital computing storing data was prohibitively expensive. In 1956 a storing a Megabyte of data cost $9,200. i.e. $9.2 Million per Gigabyte. IBM charged $850 a month (rent) for 3.75 Megabytes of storage. With a Megabyte of storage, and assuming that the average (home) address requires 50 bytes we could store 20K addresses. This is barely enough to store the addresses for for a town. When the cost of storage is this expensive storing all previous versions is unfeasible. This is also why early mobile phones had a limited contact list (100 contacts in most cases)

Today we can store an incredible amount of data on a MicroSD card the size of a postage stamp: Data storage (cost) is no longer a limiting factor, so we can afford to design all applications to be immutable.

For complete history, see: http://www.computerhistory.org/timeline/memory-storage For relative prices of data storage see: https://jcmit.net/diskprice.htm For a good intro to Bits and Bytes, see: https://web.stanford.edu/class/cs101/bits-bytes.html

dwyl / phoenix-ecto-append-only-log-example

Notes and questions on logs from reading #5

What is a log?

Keywords

Questions