finos / architecture-as-code

"Architecture as Code" (AasC) aims to devise and manage software architecture via a machine readable and version-controlled codebase, fostering a robust understanding, efficient development, and seamless maintenance of complex software architectures
https://finos.github.io/architecture-as-code/
Apache License 2.0
34 stars 17 forks source link

Refine definitions of Observability Domain(s) and Introduce Architecture Observability as a Capability #235

Open mgasca opened 7 months ago

mgasca commented 7 months ago

Feature Request

Description of Problem:

Below image is from @rocketstack-matt shared at the London on-site accelerator Screenshot 2023-11-03 at 11 12 12

Refine definitions of Observability Domains

When discussing Observability I want to ensure that all conversational participants are on the same page so that we can have clear, concise and valuable conversations related to how Observability fits in to AasC

System Observability

In the diagram shared above Observability is specifically System Observability, which is the Domain of Obsesrvability that most people are familiar with. There are actually other Domains of Observability that are relevant here, I mention a couple below.

As we know, System Observability is a qualitative characteristic that describes how well we are capable of understanding the internals of our Systems by virtue of the external signals they provide. The three canoncal pillars of System Observability are Logs, Metrics and Distributed Traces. IMO however Logs are just a degenerate form of Events (although some may argue that Events should just be considered a fourth pillar alongside the other three).

System Observability only begins to provide value once we can derive and convey Actionable Insights. One way Actionable Insights can be realized is in terms of Monitors. Monitors use concepts such as Thresholds/Capacity to Trigger Actions. These Actions can be as simple as Alerting via Channels and as complex as triggering downstream automation pipelines that transform code or are part of a set of Actions that implement Self-Healing features (or more simply Elasticity) for Systems.

As it relates to Architecture as Code, System Observability will be one of the sources of Metrics that will drive Fitness Scoring calculations and Indicators for Threshold breach calculations in Monitors that serve as (Holistic, Continuous) Fitness Functions.

image

ref Software Architecture Metrics (Ciceri, Farley, Ford, Harmel-Law, et. al.)

Data Observability

Data Observability is how well we can understand our Data by virtue of signals that are generated regarding our Data.

There have actually been good Logs/Events for various data-stores/databases for a very long time, and structured much betteer that things like Application/Service Logs. Audit tables are one form, as are the bi-temporal aspects of bi-temporal stores. For ACID databases, the Transaction Logs are actually perfect Logs. Each Atom in a Time-Series Database is actually a Log/Event for that Database. For services/bounded contexts that leverage Event Sourcing, the Event Store is a set of perfect logs for that store.

The dual to System Observability's Distributed Tracing in the domain of Data Observability is Data Lineage, which is included under the Data Domain in the image from @rocketstack-matt above. DIstributed Traces (Spans) are signals that describe flow of logic over time, Data Lineage are signals that describe flow of Data over tme.

With regards to Data Metrics, what are the important Metrics and what shape do they take? It depends. What it depends on is what Insights we are trying to derive in order to fuel specific Actions. This is the same in all Observability Domains. This will likely include things like Freshness.

We will likely want a Capablity that subsumes Data Quality/Freshness/Consistency (is this the existing Data Catalog Capability in the image above?). In the same way that the actual signals from the System Observability domain will fuel Fitness Score calculations and System Fitness Functions, Data Observability will fuel Data Freshness etc. calculations and inform Monitors that implement Data Fitness Functions around things like Quality/Consistency.

Code Observability

Code Observability is how well we can understand our Code by virtue of signals that are generated regarding our Code.

Source Control Management systems, similar to databases/stores, have been providing well structured logs for a long time. We just ned to avail ourselves of them. Considerig GIT, even if a commit is not technically a simple diff, we can think of it as such. A Commit and it's Diffs are therefore the single Events/Log Line for Code Observability. So we have perfect logs for Code Observability in the form of GIT history.

I would suggest that the dual to Distributed Tracing in Code Observability is Merges/Pushes. These describe the flow of Code between Branches and Repositories over time.

Again the types and shape of Code Observability Metrics depends on what Insights we want to derive and what Actions we want the Insights to Trigger. In fact, I believe that the set of Metrics we care about may be completely different for each specialization of "X as Code". e.g. the Metrics for Architecture as Code may be different to those for Infrastructure as Code.

If we get Code Observability right, then we automatically get things like Architecture as Code Observability, Infrastructure as Code Observability, etc. Defining AasC Metrics for the Architecture version of DORA metrics (see this point in Andrew Harmel-Law's recorded talk A Commune in the Ivory Tower for reference. He talks there about ADRs in specific but I believe the same or similar can be derived for other Architecture Artifacts to measure our efficiencies) will give us insight into the efficiencies (and bottlenecks) in our Architecture Practices as they relate to producing Architecture Artifacts.

Introduce Capability for Architecture as Code Observability

I believe we want to define and implment a Capability for Observability that builds on the realization of the Domain(s) of Observability. The Domains are how we describe these Observability Domains layered on top of the base AasC Schema. The Capability is where we define the Shape of Metrics/Events we care about in the Observability Domains that will feed into our calculations (Data Freshness, System Fitness Scores) and feed into Monitors that implement Fitness Functions and finally that will serve as Triggers for entry into other Capabilities like Drift Detection

For eample, in an idealized future: A Commit Event/Log that introduces/makes a change to an Infrastructure as Code Artifact is used by a Monitor to Trigger an Action (workflow) to re-calculate Drift Detection. As part of this workflow the new IasC Artifact is fed through a Transformer, which produces an AasC Artifact following the schema we define. This Transformer potentially also pulls in other linked AasC Artifacts, like ADRs that specify what infrastructure is used in between Ingress Gateways and Service Instances, which would allow the Transformer to know which Relationships and Entitities to Remove/Compress or Add/Expand depending on which direction the Transformation is going in. Then this newly produced AasC Manifest is diff'ed against the existing AasC committted Manifest to determine and calculate Degree of Drift. The workflow may also:

mgasca commented 7 months ago

@ojeb2 I'm curious to get your feedback on what I've said about Data Observability and also about Observability in general feeding into other Capabilities.

rocketstack-matt commented 7 months ago

@mgasca thanks for the detailed write up . . . a question on data observability are you describing two things here?

  1. Where you refer to database logs, are you talking about observing the change of the state of data in the running system? If so isn't that a form of system observability?
  2. Where you're talking about schema / lineage / catalogue I assume we're talking about what we have as the Data Domain in the picture above?
mgasca commented 7 months ago

Regarding Data Observability vs System Observability. The distinction is all about what we are Observing. i.e. Data Observability is a Characteristic of our Data and is about understanding our Data over time. System Observability is a Characteristic of our Systems and is about understanding our Systems over time. and so on...

It may be possible to say that everything is just System Observability, but that loses a lot of nuance and also understanding around implications of potential impact of the Observability Characteristic in different Domains.

Understanding the state and flow of our Data over time is quite a bit different to generally understanding the state and flow of our Systems over time. The easiest example to give here is the one I gave above around "movement". In System Observability movement is captured by Distributed Tracing and is a signal that describes the flow of logic over time (even though it also includes flow of data and it's possible to capture details about the data flowing, that's not the primary driver). Distributed Tracing doesn't really make sense when describing how our Data moves. Data Lineage is exactly the signal that describes how our Data moves. Although I believe some say that Data Traceability is distinct from Data Lineage, I think that Data Lineage may subsume Data Traceability. Interested to hear thoughts from @ojeb2 on this.

What I'm suggesting around a distinct domain of Data Observability vs System Observability is not novel nor is it my idea. There is plenty of work and research on this. That said, Data Observability is less settled at this point than System Observability. Different sources talk about different pillars. Many talk about 5 pillars and they can be for instance:

Personally I disagree with that. I'm of the opinion that there are really 3 Pillars that are common across Domains:

Freshness and Volume are just specific Metrics. This is related to what I said in the first post about the shape of specific Metrics within the domain being dependent on the Insights we try and derive.

The reason I separate out Schema (or Catalog) from Data Observability is the same reason I separate out AasC/IasC from System Observability. The reason is one that I brought up during the London On-Site Accelerator, and which @ojeb2 agreed to at the time. Namely, by nature Observability signals are time-series, because the point of Observability is observing things over time. I suppose we could say that Schema/AasC/IasC changes over time, and that would be accurate. If that's the case, then I would suggest the 4 common pillars of Observability across all Domains are:

In Data Observability Domain Structure => Schema and Catalog Movement => Lineage and potentially Tracing (whether that's distinct or subsumed) Specific Metrics => Freshness, Quality, (Data) Volume, etc.

In System Observability Domain Structure => AasC, IasC Movement => Distributed Tracing Specific Metrics => SRE Golden Signals, etc.

Regarding whether the events/logs are the same or different between Domains is in my opinion irrelevant. That question is too low level. What matters is what insights we are trying to derive and what we will do with them. On the one hand we care about Data Insights, the other System Insights

Hope that helps clarify, happy to chat more about it

rocketstack-matt commented 7 months ago

Lineage is interesting, I tend to think of lineage as part of the cataloging and therefore structure in the pillars. This is because we are able to capture as part of the metadata that defines a dataset where the data came from that populates a specific attribute (e.g. did the system produce the data attribute as part of it's core functionality or has it received data from another system which it is then either passing on unchanged or has been used in a calculation / transformation to produce a new derived data attribute.

In effect we're able to capture the 'movement' of the data between systems and the transformations applied to them as part of the structure.

mgasca commented 7 months ago

Yes, I hear what you are saying. The people that separate Data Lineage and Data Traceability would agree with that definition of Data Lineage. There are other people that say that Data Lineage is not (just) the intent but also the run-time operational reality. As an example, take a look at https://openlineage.io (which takes inspiration from OpenTelemetry in the System Observability space). Dataset there is Structure, Runs are Intent of Movement (like you describe) and Jobs are the runtime operational reality of Data Movement (i.e. the reality of the intent). That Run in OpenLineage is, for instance, what I refer to when I say Lineage is the dual to Distributed Tracing (Trace composed of Spans in OpenTelemetry), not the static structural side (DataSet or Job).

Comparing this to Architecture as Code and Infrastructure as Code, they both describe intent of Movement, though IasC is also realizable. However the actual reality of that intended flow operationally (as opposed to the structural static definition) would be both of Distributed Tracing and the runtime aspects of Lineage (whether we still call that Lineage or we call it Data Traceability)

e.g.

AasC and IasC both say Service-A should be communicating via gRPC/OAuth to Service-B. What does Distributed Tracing say is actually happening at run-time and over time?

AasC and IasC alongsside Data Catalog (and potentially including structural aspects of Lineage) say that Container-X ETLs from Store-Q to Store-P. Alternately AasC and IasC longsside Data Catalog (and potentially including structural aspects of Lineage) say that Job-X pulls from Store-Q and feeds AI-Model-Z at given intervals. What does run-time operational data (whether we call it Lineage or Data Traceability) say is actually happening in either of these cases?

mgasca commented 7 months ago

Another set of examples from across industry supporting the idea that Data Lineage includes the runtime operational reality of Data Movement is this post on the atlan blog (atlan is a Catalog and Lineage Store)

https://atlan.com/open-source-data-lineage-tools/

The way they describe Lineage requires that it includes the operational Movement metadata. All five of them open source Lineage tools they mention gather operational Movement signals.

Funnily enough, it seems that Pachyderm uses a version control system as the store. So Data Lineage as Code in a sense. They are using version control as the event store, as I mention is the case for general Code Observability. I also noted that if you get Code Observability right you get X as Code Observability for free. To a certain degree it seems this is the route they are taking.