dwyl / phoenix-ecto-append-only-log-example

πŸ“ A step-by-step example/tutorial showing how to build a Phoenix (Elixir) App where all data is immutable (append only). Precursor to Blockchain, IPFS or Solid!
GNU General Public License v2.0
78 stars 7 forks source link

Notes and questions on logs from reading #5

Open Cleop opened 6 years ago

Cleop commented 6 years ago

As someone who is new to logs I could follow the example. However whilst discussing the subject area with others it brought up key terms and subject areas that were new to me. I think it would be useful to include some of this context in the readme for those who may stumble upon this repo without knowing what it is first.

What is a log?

AKA write-ahead log, commit log, transaction log. In this repo it will not refer to 'application logging', the kind of logging you might see for error messages.

A log is one of the most simple possible storage abstractions. An append-only, totally-ordered sequence of records ordered by time. They are visualised horizontally from left to right.

They're not all that different from a file or a table. If we consider a file as an array of bytes and a table as an array of records. Then a log can be thought of as a kind of table where records are sorted by time.

Logs are event driven. They record what happened and when continuously. As the records are stored in the order that the changes occurred this means that at any point you can revert back to a given point in time by finding it in your records. They can do this in near real-time, making them ideal for analytics. They are also helpful in the event of crashes or errors as their record of the state of the data at all times means data can easily be restored. By keeping an immutable log of the history of your data it means your data is kept clean and is never lost or changed. The log is added to by publishers of data and used / acted upon by subscribers but the records themselves cannot be mutated.

Keywords

Time series database: a database system optimised for handling time series data (arrays of numbers indexed by time). They handle queries for historical data/ time zones better than relational dbs.

Data integration: making all the data an organisation has available in all its services and systems.

Log compaction: methods to tidy up a log by deleting no longer needed data.

Questions

These notes and questions came from reading: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

nelsonic commented 6 years ago

Hi @Cleop, thank you for opening this issue and summarising your knowledge quest! πŸŽ‰

Stoked that you found the Example created by @Danwhy clear and followed it on your localhost. πŸ₯‡

Answers to your Questions (above)

There is no need to confuse beginners with the term "write-ahead log" https://en.wikipedia.org/wiki/Write-ahead_logging ("TMI"); it's just an append-only log. Hopefully people will understand the words: "append", "only" and "log" ... πŸ’­ (if not, they probably aren't "ready" for this example/tutorial ...)

In general/practice, log compaction is never needed for most companies/products these days. Data storage costs are so cheap (see detail below) that every App developer can easily afford to store all data generated by their App - provided the app does not store arbitrary data for no reason. In most companies, data is the "life blood" of the strategic decision making, the entire field of "BI" https://en.wikipedia.org/wiki/Business_intelligence relies on having as much data as possible.

Aside: Never Over-write or Delete Data it is your Most Valuable Asset!

I feel "qualified" to assert that using Append-only Logs for everything in a company are the single best technology decision that can be made. Logs are the foundation for the building. I helped setup "BI" at Groupon, no data was ever deleted or compacted, only periodically "warehoused" to save on storage cost, but still accessible via Hadoop. Per employee, we were the single biggest source of additional revenue, cost savings and value creation in the company. A "department" of 8 people generated $20M of annual value for the company by spotting trends in markets, products & buying behaviour. πŸ“ˆ Hedge funds (_like Bridgewater or Citadel_) use data analysis to multiply cash. It's a fascinating world/activity that generates no (NET) value to society but makes a handful of people phenomenally wealthy! https://en.wikipedia.org/wiki/The_rich_get_richer_and_the_poor_get_poorer

If you're ever interested in learning how some Mathematicians use their skills to print money, read: https://en.wikipedia.org/wiki/The_Quants (I think we still have a copy in the @dwyl library...) Or if you're short on time or prefer to watch the dramatised version, The "Big Short" is good. πŸ“Ί If you haven't seen it, add it to your list: https://www.imdb.com/title/tt1596363 Or [Spoiler Alert] just watch the ending: https://youtu.be/Bu2wNKlVRzE The most insightful part of the film is in the end credits. Michael Burry is investing in Water: image If you or anyone else has "spare capital", to do the same; invest in water (recovery/treatment). (_not to profit from gouging the poor! but water treatment startups/ideas/tech are "win-win"_)

tl;dr

Some Data gets Updated, Most Never Does

Many developers have an "observation bias" toward data changing (mutability) because we often change the data we interact with and we therefore think that mutability is the "norm" but it really is not. Once people realise the fact that most data gets created/written once and never gets updated. Append-only logs are (unsurprisingly) common in virtually every area of computing.

Even when data appears to be mutable (from the User's perspective) it is often stored in an immutable log so the underlying store is immutable but the UI/UX appears that data is being mutated because the UI only displays the latest version of the data.

In many popular web frameworks, preserving "history" of a record (or piece of content) is often done with a "history table" which stores copies of the "previous" versions of data whenever a record is updated.

Examples in Every Industry

Examples of use-cases where Append-only Logs are useful were included in the original "Why? What? How?" issue: https://github.com/dwyl/phoenix-ecto-append-only-log-example/issues/1

To those examples we can add:

Fact is: that append-only logs while not often mentioned by name, are in fact the norm. Think of an application that stores data of any kind, and you can see how it can either benefit from (or is totally dependent on) data immutability.

Everything is always an event. Everything in the universe happens once. Time is linear and uni-directional. Even in the "multiverse hypothesis", time is still uni-directional, we cannot (no theory in quantum mechanics allows us to) go back in time; Mutating data is "time travelling"; it's updating data that was created in the past and destroying the history/timeline. In a distributed system mutating data means unpredictability/unreliability.

Address Book Example

In the example given in the README.md of an Address Book, we illustrate how a person can "update" their address when they move home, however (in the example) their previous address does not get over-written, we simply insert their new address into the database and the new address is treated as the "current" version. That way the "Address Book" has a complete record of the history. This is a good use of an append-only log because address data will almost always change over the lifetime of a person, but knowing previous addresses is useful (and in some cases essential).

If Thor decides to leave his parents' house (planet) His "old" address for reference:

Thor
The Hall, Valhalla, Asgard
AS1 3DG
0800123123

thor-old-address and crashes on his buddy Stephen's couch till he gets his own place in NY. His _new (temporary) address will be:

Thor Odinson c/o Dr. Strange
177A Bleecker Street,
New York, NY 10012, USA

thor-new-address

Thanks @mathiasbynens for this handy character/byte counter: https://mothereff.in/byte-counter πŸ₯‡

The average person in the United States is expected to move 11.4 times in their lifetime: https://fivethirtyeight.com/features/how-many-times-the-average-person-moves

Historical Context of Mutable Data: It Was Too Expensive to Store Everything

To understand why many Web Application frameworks (still) over-write data when an update is made, we need to look back to when data storage was expensive. At the dawn of digital computing storing data was prohibitively expensive. In 1956 a storing a Megabyte of data cost $9,200. i.e. $9.2 Million per Gigabyte. IBM charged $850 a month (rent) for 3.75 Megabytes of storage. image With a Megabyte of storage, and assuming that the average (home) address requires 50 bytes we could store 20K addresses. This is barely enough to store the addresses for for a town. When the cost of storage is this expensive storing all previous versions is unfeasible. This is also why early mobile phones had a limited contact list (100 contacts in most cases)

Today we can store an incredible amount of data on a MicroSD card the size of a postage stamp: image Data storage (cost) is no longer a limiting factor, so we can afford to design all applications to be immutable.

For complete history, see: http://www.computerhistory.org/timeline/memory-storage For relative prices of data storage see: https://jcmit.net/diskprice.htm For a good intro to Bits and Bytes, see: https://web.stanford.edu/class/cs101/bits-bytes.html