Classification of block validation types

This proposal seeks to build upon the existing block validation system to start being more proactive in reducing the cases where the final user is faced with an invalid block type dialog. This is a proposal that involves manipulating data on behalf of the user so it needs both a solid definition system (in steps) to ground the level of inference the engine is willing to do to restore content integrity.

Block Validation Flow

Block validation, as defined here, comes after the initial parsing step which yields block types and checks whether a corresponding block type handler exists (meaning, that a corresponding block type is registered in the system). That phase is still objectively part of the validation process but is not the focus of this issue since there are no further feasible optimization to do there regarding content reconstruction. What this one focuses on is what happens when the block source is run through its save function, producing a new source.

// A block source is run through the `save` function of its `blockType`
blockType.save( source ) => newSource;

// The resulting operation is classified for every block given the following outcomes
block.isValid: Number;

The proposal here focuses on classifying the outcomes of the above operation by logically grouping the possible scenarios in a decreasing order of certainty over content integrity.

ValidBlock: 0            // idempotent operation of `save(source) => source`.

MigratedBlock: 1         // source is matched sequentially with defined deprecations
                         // until it produces a match.

PreservedSource: 2       // `newSource` produces equivalent `innerHtml` even if
                         //  comment attributes differ and becomes idempotent
                         // after first reconciliation.

ReconstructedSource: 3   // `newSource` contains the same attributes as `source`
                         // (attribute integrity), including non-empty sourced
                         // attributes, while `innerHtml` is allowed to be rebuilt.

RawTransformedSource: 4  // source is passed to raw handling functions and it yields
                         // the same block type.

InvalidBlock: 5          // the block could not be safely restored, need user input

0: ValidBlock

This operation is run for all blocks and is the cheapest mechanism to ensure integrity. It has to be optimized for speed. This has already served us super well as a quick heuristic.

1: MigratedBlock

When the basic match operation in level 0 fails, the first path is to check whether deprecated shapes exist for the block. If they do, we run the source against each of those sequentially until we find a match. This is also the step in which source can be migrated to newer sources. It has been in place for a long time and has allowed transparently upgrading the shape of a blocks numerous times.

2: PreservedSource

Starting at this level the block validation mechanism has been unable to reconcile the source with the output, which means there's a problem it doesn't have explicit instructions on how to handle. From here on there's potential of some data loss and is the main point of discussion.

For this level 2 we'd suspend the integrity of comment attributes in favor of the integrity of the source, provided the innerHtml of source and the innerHtml of the output match. It's also important that we achieve idempotency immediately after reconciling the comment attributes.

// Example
// - source
<!-- wp:heading {"level":3} -->
<h2>Testing Header</h2>
<!-- /wp:heading -->

// - output
<!-- wp:heading -->
<h2>Testing Header</h2>
<!-- /wp:heading -->

// inner html matches both instances

This level can be seen as "clean up spurious comment attributes malformations". Worth pointing out that this relies on the fact the block author has made a choice for us regarding what matters for sourcing the content attributes, since it is prioritizing the h2 content tag and not the comment attribute "level".

3: ReconstructedSource

In this variant, the html comments coincide but the inner html doesn't. If there is attribute integrity in oldSource and newSource (and sourced attributes are not empty) we let the block output be overwritten as the result of computing save again. This is an indication the block author leans on the comment attribute as the source of truth rather than the html source, and we honor that decision.

// Example
// - source
<!-- wp:heading {"level":6,"textColor":"pale-pink"} -->
<h6>Testing Header</h6>
<!-- /wp:heading -->

// - output
<!-- wp:heading {"level":6,"textColor":"pale-pink"} -->
<h6 class="has-pale-pink-color has-text-color">Testing Header</h6>
<!-- /wp:heading -->

// restores generated class names

This step can be seen as an "implicit migration" for indefinite forms, in contrast with those defined in deprecations.

4: RawTransformedSource

At this level we have lost confidence in the block functions being able to reconstruct a source so we hand over the operation to the raw handling mechanism (this is currently exposed to users in the "convert to blocks" action on invalid block types).

What we seek here is to ensure the resulting blockType.name is actually a match between source and output. This is the most aggressive manipulation since it doesn't rely directly on specific instructions from a block's save method but it can employ its raw transformations (as if content was being pasted or was part of a freeform block "convert to blocks" operation).

// Example : where it succeeds restoring block type
// - source
<!-- wp:heading -->
<h2>Testing Header</p>
<!-- /wp:heading -->

// - output
<!-- wp:heading -->
<h2>Testing Header<p></p></h2>
<!-- /wp:heading -->

And here's an example where it currently fails to restore the block type.

// Example : where it fails restoring block type
// - source
<!-- wp:heading -->
<span>Testing Header</h2>
<!-- /wp:heading -->

// - output
<!-- wp:paragraph -->
<p><span>Testing Header</span></p>
<!-- /wp:paragraph -->

As noted, in the last example the wp:heading is not preserved so we would not consider it passes level 4.

User Experience

If we keep the type of invalidation that occurred stored in the block object, it would allow us to build UI and behaviour around it. For example, we might choose to show somewhere in the interface (block inspector or block toolbar) that a transformation above level 1 has taken place, so the user could review it if they see something amiss and take a different choice than what the system has done.

There's obviously also room to work upon the save( source ) => newSource operation to discard or omit certain elements we might consider less strict (like class names or other html attributes in the wrapper elements) which would cascade through all the levels and likely prevent going from one level to another for some cases.

Interested in hearing your thoughts on this. Do the classifications make sense? Is the order what you'd expect? Is there another vector we might choose to classify as well?

There might be a chance to have another classification after level 3, for example, where we attempt the reconstruction of both the comment and the inner html provided the block type name remains the same before passing on to pure raw handling mechanisms.

cc @aduth @mcsf @youknowriad @dmsnell

Interesting here :) what's unclear to me is:

would we allow the block author to define its preferred level of deprecation?
if a block author chooses 3, does that mean we run 0 -> 1 -> 2 -> 3 and the sooner we pass the validation, we stop?

My inclination would be to not expose this directly but only indirectly: adding deprecations is an indirect way of influencing what is valid / invalid; adding transformations / migrations is a way of opting in into raw handling scenarios, etc.

Thanks @mtias for sharing! I love the importance of handling "invalid blocks"

One immediate thought that sticks out is that we might want to reverse the numbering or change the names. Regardless of concepts here the proposal sets a valid block to isValid: 0 which would lead to confusing code like this and I'm sure a slurry of bugs due to JS's weak typing being happy to compare numbers with booleans…

if ( block.isValid ) {
    // block is invalid?
}

A reframing could be a more passive property of the block:

if ( block.decay ) {
    // something was "healed" in this block
}

As a classification system I like this delineation - it's helpful for understanding how to handle the different types of blocks. I believe we still have some cases missing though:

1.5. Extra attributes: The innerHTML and the comment attributes are the same except we have additional attributes in the comments. Maybe this is leftover from a block migration or maybe it was intentional. It would be nice to let the block operate as usual but remember those additional attributes and preserve them on save()
2.5. Extra styles: The innerHTML matches except for additional classes and ids. Adding extra classes to the raw HTML seems like such a common operation that it would warrant an exception. The block can't reasonably track all those extra styles, but we could present a choice.
1. The external edits that return after every save to the database. These are the worst kind of conflict because there is no resolution. Something in WordPress or another backend is mangling the content outside of our control and every time we fix it we find that it's broken on the next load of the post in the editor.

In most of these cases I feel like the bigger value is in handling this as an editor framework rather than in exposing it to the blocks, though I see not problem in exposing it.

If we could show two panels next to each other as a before/after and let the author choose which to keep then we could use these classifications to give the context necessary to help decide.

1. PreservedSource It looks like edits were made to this post outside of the block editor but it's not clear if those were intentional. Review the changes and choose if you want to keep what the editor is producing or if you want the editor to try and adopt the HTML changes.
1. ReconstructedSource Some meta information about this block is missing. If the block reconstructs the HTML from its data the output will be different than what's currently stored in the post. This could be due to a version update or some broken integration on the backend. Do you want to reconstruct the HTML the way the block is attempting to do so or preserve the glitch. If you preserve it you won't be able to edit the block.
1. RawTransformedSource The block specifies that it's of a given type but the editor cannot meaningfully interact with it as saved in the post. You may attempt to split it into its constituent parts as separate blocks, convert it into an HTML block, or preserve the glitch. If you preserve it you won't be able to edit the block.

I like the general idea here. I think it gets most interesting at what's described as Stage 3 and above, in having a systematic way to categorize the extent of invalidation in a way that can be used to present the user with options. I seem to recall at some point in the past there was some experimentation in the idea of considering block attribute values for validation, disregarding whatever the HTML might be. It clearly has the potential for being destructive to markup, which is why it wasn't ever implemented. But if it's something which can be presented as an option to the user, it might be more appetizing to consider.

At least coming from my own understanding of the background of the current validation procedure, I have some confusion or concern for what is considered as proposal of Stages 0 through 2.

Stage 0 is pretty obvious. And yet, despite that, it might actually not be implemented today! The block validation currently goes through a "equivalence" procedure to see if markup is same with some tolerances (whitespace, etc) (source). In retrospect, it seems clear that there should be a shortcut here for identical markup.
In general, I don't know how this "equivalence" logic is considered as part of this proposal, or if it's viewed as interchangeable with the "basic match" of Stage 0. It's worth noting that this equivalence logic handles quite a bit of normalization, even in cases of rather extreme differences of the markup (example).
- Related to this point, if it's considered separate from Stage 0 basic match, at what time in this process should it be considered? Would we want "MigratedBlock" to be evaluated as matching the source, or as equivalent to the source.
Stage 3 PreservedSource isn't really a consideration today. The markup of comment attributes in completely disregarded in evaluating blocks to be valid or invalid. Today, only the inner markup is considered. Even if attributes were completely different, if the markup is the same, the block is treated as valid. It's not clear to me if we'd want to introduce such invalidations based on difference in the comment syntax, or if it's something we should continue to ignore.

With regards to the original topic #7604, one of the things I've been thinking about over time is whether the framework can be doing more to be aware of differences in a way which they can be reapplied to resulting markup without intervention. This already exists to an extent with class names ("custom class name"). If a class attribute is detected in the markup of the top-level element of a block which is not expected, it is "held" and then reapplied to the resulting save. Without this, the additional class would be considered as invalidating. I've contemplated whether this could be expanded to behave the same way for additional attributes, even without exposing it to the user directly.

My hunch is that a lot of the invalidations we'll see are these sorts of cases where, for example, a plugin had decided to add a hook which modifies markup to add an attribute, or the user manually edits the markup to add an attribute.

Is this something you see as being within the scope of this issue? If so, where would it fit?

In retrospect, it seems clear that there should be a shortcut here for identical markup.

Agreed.

In general, I don't know how this "equivalence" logic is considered as part of this proposal, or if it's viewed as interchangeable with the "basic match" of Stage 0

Good point. It assumes we can do equivalency as isEqual || isEquivalent considering the first one as the fastest and the second as more expensive throughout the stages as aplicable.

It's not clear to me if we'd want to introduce such invalidations based on difference in the comment syntax, or if it's something we should continue to ignore.

Stage 3 is worthwhile to me (the example is a good illustration) in that it can possibly relax some of the requirements for duality in attributes that have expression in the markup. Perhaps worth clarifying that I don't envision it as working with the markup of the comment but with attribute consistency for non-sourced attributes. There are blocks where the inner HTML (or most of it) is just a one way reconstruction from comment attributes and it's unnecessary to attempt to normalize HTML when it can be left to be rebuilt.

Is this something you see as being within the scope of this issue? If so, where would it fit?

I think it's orthogonal to this issue in that I include it conceptually in "normalization" (being flexible about certain specific attributes like classes) but I worry about adding too many ad hoc exclusions without a solid mapping of the full flow, since it's going to be a bit more fragile to handle by definition.

In retrospect, it seems clear that there should be a shortcut here for identical markup.

Agreed.

See #22506

One immediate thought that sticks out is that we might want to reverse the numbering or change the names.

@dmsnell Yes, for sure, or just flip 0: Invalid Block and 5: Valid Block determinations. For presentation it seemed better to show the increasing progression, though.

1.5. Extra attributes

This was meant to be covered in the example in 2, where there is an extra attribute that disappears after reconciliation. Do you think it wouldn't be enough?

2.5. Extra styles

@aduth hints at it in his response but I consider attributes we are fine with preserving even if they don't originate from the edit / save functions of a block as an internal part of our equality, normalization, and serialization steps.

In most of these cases I feel like the bigger value is in handling this as an editor framework rather than in exposing it to the blocks

Indeed. I'd say that for this to approach any sense of robustness it should not be exposed as a public API. A block author should interface with more expressive APIs (deprecations, transforms, save, etc) rather than the internal heuristics.

Also for extra clarity I'm not anticipating any user facing decisions yet. I'd like to establish the system first and see what consequences it produces before burdening users with making a decision.

For presentation it seemed better to show the increasing progression, though.

Yeah I saw the tension, but practically in the code I think that having something named in the positive meaning it's okay but having a falsely value would be confusing at best. Both cases it's a presentation problem.

covered in the example in 2, where there is an extra attribute that disappears after reconciliation. Do you think it wouldn't be enough?

@aduth fleshed out my point better for me. these aren't "invalid" attributes - they're simply attributes the block is unaware of but which some other plugin might be operating on. it would be very convenient if we add these and have the editor keep them around - "no harm, no foul"

one situation that came to my mind was adding something like loadClientScript which could allow blocks to specify client-side JS they need to load so that we can build JS-only blocks. I recognize the problems inherent in this example but the same idea could apply to styles also. suppose I want to make a plugin operating over multiple blocks and block types - if you add additionalStyles to the attributes I'll mark them on page render.

the difference between this and extra CSS styles is that these would be consumed by some filtering function in the PHP/backend whereas styles are just that - CSS

1.5. Extra attributes and 2.5. Extra styles

In GB-mobile perspective these are something that we really need because the mobile version can be behind in terms of versions of blocks, or we can be connected to a server that installed extra plugins that change core blocks ( for example Coblocks) the ability to be able to parse and accept those blocks will be a great addition to the mobile apps.

I love the direction this is taking in terms of providing a clear communication mechanism for varying levels of success/fail in the block validation process. The levels make sense to me and on the surface it looks like it offers some logical presentation points for the author to take. Presumably the ui/ux presented to authors could also generate varying degrees of action needed depending on the resulting level (Example a "fyi this happened and you can adjust if you want: click link" vs "This block needs your action before it can be rendered").

I agree with @dmsnell 's feedback here about the numbering, although that likely can be mitigated somewhat by using "constant" names for the levels instead.

The only other thing I have to add to the discussion here is whether there's any value in considering if plugins can inject their own invalidation handling if needed? If this was considered, we'd probably want to limit it to when a block reaches level five, there is mechanism for plugins to hook in and offer their own validation handling at that point. An advantage of this mechanism is it could provide an escape hatch for various edgecases plugins might have where the existing validation logic simply doesn't work. Presumably, if in the future GB was to enhance validation further (additional levels, or more intelligent logic), it also would provide a way for plugins to roll out supporting that logic earlier in the case where they still need to support earlier WP versions without it.

I'm adding to this issue the problem posed by block patterns when the blocks they contain conflict with the runtime (either the running Gutenberg is outdated, or the block pattern is): https://github.com/WordPress/wordpress-develop/pull/1483#issuecomment-877982616. This might just be solved with pattern versioning, but it may inform our RFC.

Just wanted to follow up on WordPress/wordpress-development#1483. I've elected not to include the pattern updates in the bundled theme release accompanying WordPress 5.8. This will leave console.info() notices, but it will prevent block validation problems in a number of theme and WP version combinations. See my comment on the ticket for full details.

Having a way to version patterns, an API that generates the appropriate block markup for a version of WordPress based on a set of data, or validation that's more "safe" with old block syntax would solve this issue. But as it is now it's really difficult to update patterns in Core while continuing to support multiple versions of WordPress.

WordPress / gutenberg

Classification of block validation types #21703

Block Validation Flow

User Experience