Open Cleop opened 6 years ago
Hi @Cleop, thank you for opening this issue and summarising your knowledge quest! π
Stoked that you found the Example created by @Danwhy clear and followed it on your localhost
. π₯
[ ] "Physical vs Logical?" >> Logical. We are storing all data "CRUD" operations as new rows in a Postgres Database table and then (much later in the lifecycle of the product) computing a "View" https://en.wikipedia.org/wiki/View_(SQL) on that data to optimise SELECT queries.
This is way too much detail (complexity) for a beginner to worry (or even know/think) about, but the curious can read:
[ ] All logs are time-series by definition. If a record is stored with a timestamp, it's a time series. Perhaps a useful clarification/distinction from the Wikipedia article https://en.wikipedia.org/wiki/Time_series_database (which is kinda useless/confusing) ... π Any data can be stored as time series and most ("big") data is!
A "beginner-friendly" (+funny!) intro to "Big Data": https://www.bbc.co.uk/programmes/b0b9wbf8 In the Apps we're building, we're using a UUID as the Primary Key (PK) to avoid write conflicts. Even with Erlang's Timestamp precision being microseconds anything more than 50 DB writes per second would result in an unacceptably high chance of PK collision. π Timestamps in
Elixir
: https://michal.muskala.eu/2017/02/02/unix-timestamps-in-elixir-1-4.html
[ ] "Are all logs append only?" >> Yes, all logs worth knowing/using are append-only. If logs are anything other than append-only they are a source of chaos/confusion. πΏ
Good "further reading" on this topic:
[ ] "What is a record in the context of a log? Is it the equivalent of a row in a table?" >> Yes.
All records are rows. All rows have column headings. Most fields after the essential are optional.
For example in an Address Book there's no point having an empty row
(i.e. if all are fields optional, blank records would be "OK", this is obviously undesirable),
so the minimum data acceptable for a new Address Book entry is the person's Name
;
everything is optional and can be added on subsequent writes/updates.
[ ] "Do all logs run horizontally?" >> Don't confuse yourself with the direction of "travel". All Time is linear and uni-directional. If in any doubt, logs run Vertically Top-to-bottom like reading a "page".
Think about how you like to scroll a screen/phone using your keyboard, mouse or finger: How often do you scroll horizontally? (almost never. it's a UX anti-pattern) By contrast scrolling vertically is a natural UX. There might be the occasional timeline in a history book that is horizontal for aesthetic reasons. But it almost always means they have "truncated" the data points to fit on the page. Almost no computer systems are horizontal-scrolling. Note: horizontal scrolling is not the same as "swiping" on a mobile; the go-to UX for all hipster apps: https://uxplanet.org/horizontal-scrolling-in-mobile-643c81901af3
[ ] "Is a timestamp value mandatory in a record in a log?" >> Yes, by definition.
And thankfully, Phoenix/Ecto already stores timestamps: inserted_at
. π
Timestamps should never be used as a unique identifier in any serious system. The Primary Key (PK) should be Universally Unique (_hence using using a UUID_) i.e. virtually zero chance of conflict. Using a Timestamp would almost always guarantee conflict where writes-per second are greater than 1k. In a modest "Chat" application with 100k concurrent users, 1k writes/sec are "normal". (each person sends one message once every couple of minutes...) See: http://www.internetlivestats.com
[ ] "horizontal scaling partitions useful?" >> π (What will you do when you win the lottery?) (it's good that you are keen! but you are over-thinking this ... com back to this question in 2020!) Thinking about Horizontal Scaling is way beyond the "scope" of this example/tutorial. To be clear: we are using an Append-only Log for reliability, accountability and record rollback-ability, not "scalability" (i.e. "premature optimisation"). Scalability is a distant future benefit, not a focus.
If you (or anyone else) are personally interested in Scalability you should read: https://stackoverflow.com/questions/11707879/difference-between-scaling-horizontally-and-vertically-for-databases Using an Append-only Log with UUIDs as PKs is all the "ground work" we need to ensure that anything we build is prepared to scale both Vertically and Horizontally. β π When any of our apps reaches 10k writes/sec we will be insanely "successful". π¦ π An AWS RDS
db.m4.16xlarge
instance has 256GB of RAM and can handle 10GB of "throughput". It's been benchmarked at 200k writes/second ... if we ever need to use one of these instances, we'll all be sipping coconut water on the shore of the @dwyl island! π΄βοΈπΈ πββοΈ π’ π β΅οΈ
[ ] 'distributed data systems' - "if a log is a move away from them" ...? >> Huh...? π Logs are the basis for all distributed/fault-tolerant systems. That's all we need to know/say.
Please avoid adding more terms/complexity to the readme than are strictly necessary; anything more than "append-only log" will automatically "confuse" people who new to this. If you are curious about understanding this in way more detail, read: https://raft.github.io and watch @substack's talk on "What you can build with a log": https://youtu.be/RPFjN1N148U For the purposes of all developers using Phoenix, just trust PostgreSQL to handle your data. You're in "good company" https://github.com/dwyl/learn-postgresql/issues/31 And if our app is lucky enough to be "successful", look into CitusDB
There is no need to confuse beginners with the term "write-ahead log" https://en.wikipedia.org/wiki/Write-ahead_logging ("TMI"); it's just an append-only log. Hopefully people will understand the words: "append", "only" and "log" ... π (if not, they probably aren't "ready" for this example/tutorial ...)
In general/practice, log compaction is never needed for most companies/products these days. Data storage costs are so cheap (see detail below) that every App developer can easily afford to store all data generated by their App - provided the app does not store arbitrary data for no reason. In most companies, data is the "life blood" of the strategic decision making, the entire field of "BI" https://en.wikipedia.org/wiki/Business_intelligence relies on having as much data as possible.
Aside: Never Over-write or Delete Data it is your Most Valuable Asset!
I feel "qualified" to assert that using Append-only Logs for everything in a company are the single best technology decision that can be made. Logs are the foundation for the building. I helped setup "BI" at Groupon, no data was ever deleted or compacted, only periodically "warehoused" to save on storage cost, but still accessible via Hadoop. Per employee, we were the single biggest source of additional revenue, cost savings and value creation in the company. A "department" of 8 people generated $20M of annual value for the company by spotting trends in markets, products & buying behaviour. π Hedge funds (_like Bridgewater or Citadel_) use data analysis to multiply cash. It's a fascinating world/activity that generates no (NET) value to society but makes a handful of people phenomenally wealthy! https://en.wikipedia.org/wiki/The_rich_get_richer_and_the_poor_get_poorer
If you're ever interested in learning how some Mathematicians use their skills to print money, read: https://en.wikipedia.org/wiki/The_Quants (I think we still have a copy in the @dwyl library...) Or if you're short on time or prefer to watch the dramatised version, The "Big Short" is good. πΊ If you haven't seen it, add it to your list: https://www.imdb.com/title/tt1596363 Or [Spoiler Alert] just watch the ending: https://youtu.be/Bu2wNKlVRzE The most insightful part of the film is in the end credits. Michael Burry is investing in Water: If you or anyone else has "spare capital", to do the same; invest in water (recovery/treatment). (_not to profit from gouging the poor! but water treatment startups/ideas/tech are "win-win"_)
Many developers have an "observation bias" toward data changing (mutability) because we often change the data we interact with and we therefore think that mutability is the "norm" but it really is not. Once people realise the fact that most data gets created/written once and never gets updated. Append-only logs are (unsurprisingly) common in virtually every area of computing.
Even when data appears to be mutable (from the User's perspective) it is often stored in an immutable log so the underlying store is immutable but the UI/UX appears that data is being mutated because the UI only displays the latest version of the data.
In many popular web frameworks, preserving "history" of a record (or piece of content) is often done with a "history table" which stores copies of the "previous" versions of data whenever a record is updated.
Examples of use-cases where Append-only Logs are useful were included in the original "Why? What? How?" issue: https://github.com/dwyl/phoenix-ecto-append-only-log-example/issues/1
To those examples we can add:
You may recall at the End of your F&C course you were applying to work for a company and they asked you to do a coding challenge/exercise that involved computing the values of a bank account. We paired on solving that challenge using a Log (debit and credit ledger). Data was never mutated. The account balance was always calculated "just in time" by "traversing" the Log.
Fact is: that append-only logs while not often mentioned by name, are in fact the norm. Think of an application that stores data of any kind, and you can see how it can either benefit from (or is totally dependent on) data immutability.
Everything is always an event. Everything in the universe happens once. Time is linear and uni-directional. Even in the "multiverse hypothesis", time is still uni-directional, we cannot (no theory in quantum mechanics allows us to) go back in time; Mutating data is "time travelling"; it's updating data that was created in the past and destroying the history/timeline. In a distributed system mutating data means unpredictability/unreliability.
In the example given in the README.md
of an Address Book,
we illustrate how a person can "update" their address when they move home,
however (in the example) their previous address does not get over-written,
we simply insert their new address into the database
and the new address is treated as the "current" version.
That way the "Address Book" has a complete record of the history.
This is a good use of an append-only log
because address data will almost always change over the lifetime of a person,
but knowing previous addresses is useful (and in some cases essential).
If Thor decides to leave his parents' house (planet) His "old" address for reference:
Thor
The Hall, Valhalla, Asgard
AS1 3DG
0800123123
and crashes on his buddy Stephen's couch till he gets his own place in NY. His _new (temporary) address will be:
Thor Odinson c/o Dr. Strange
177A Bleecker Street,
New York, NY 10012, USA
Thanks @mathiasbynens for this handy character/byte counter: https://mothereff.in/byte-counter π₯
The average person in the United States is expected to move 11.4 times in their lifetime: https://fivethirtyeight.com/features/how-many-times-the-average-person-moves
To understand why many Web Application frameworks (still) over-write data when an update
is made,
we need to look back to when data storage was expensive. At the dawn of digital computing
storing data was prohibitively expensive. In 1956 a storing a Megabyte of data cost $9,200.
i.e. $9.2 Million per Gigabyte. IBM charged $850 a month (rent) for 3.75 Megabytes of storage.
With a Megabyte of storage, and assuming that the average (home) address requires 50 bytes we could store 20K addresses. This is barely enough to store the addresses for for a town.
When the cost of storage is this expensive storing all previous versions is unfeasible.
This is also why early mobile phones had a limited contact list (100 contacts in most cases)
Today we can store an incredible amount of data on a MicroSD card the size of a postage stamp: Data storage (cost) is no longer a limiting factor, so we can afford to design all applications to be immutable.
For complete history, see: http://www.computerhistory.org/timeline/memory-storage For relative prices of data storage see: https://jcmit.net/diskprice.htm For a good intro to Bits and Bytes, see: https://web.stanford.edu/class/cs101/bits-bytes.html
As someone who is new to logs I could follow the example. However whilst discussing the subject area with others it brought up key terms and subject areas that were new to me. I think it would be useful to include some of this context in the readme for those who may stumble upon this repo without knowing what it is first.
What is a log?
AKA write-ahead log, commit log, transaction log. In this repo it will not refer to 'application logging', the kind of logging you might see for error messages.
A log is one of the most simple possible storage abstractions. An append-only, totally-ordered sequence of records ordered by time. They are visualised horizontally from left to right.
They're not all that different from a file or a table. If we consider a file as an array of bytes and a table as an array of records. Then a log can be thought of as a kind of table where records are sorted by time.
Logs are event driven. They record what happened and when continuously. As the records are stored in the order that the changes occurred this means that at any point you can revert back to a given point in time by finding it in your records. They can do this in near real-time, making them ideal for analytics. They are also helpful in the event of crashes or errors as their record of the state of the data at all times means data can easily be restored. By keeping an immutable log of the history of your data it means your data is kept clean and is never lost or changed. The log is added to by publishers of data and used / acted upon by subscribers but the records themselves cannot be mutated.
Keywords
Time series database: a database system optimised for handling time series data (arrays of numbers indexed by time). They handle queries for historical data/ time zones better than relational dbs.
Data integration: making all the data an organisation has available in all its services and systems.
Log compaction: methods to tidy up a log by deleting no longer needed data.
Questions
These notes and questions came from reading: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying