The Content Normalizers API

Description

During work on paste from google docs, there emerge an idea of a ContentNormalizers. All further concepts are related to engine#view and input data from the clipboard. As input data you should understand data obtained from inputTransformationEvent.

The Content Normalizer is a class which provides an object instance. This instance is dedicated to transforming input data came from a specific type of source: for example MS Word, Google Docs, Online Office, etc. Transformation is defined with filters, which are registered for each normalizer separately. One filter might be used in multiple normalizers. Content Normalizer contains logic which decides if its filters should be applied to a given input data.

Examples of filters purpose:

there could be defined filter which transforms style font-weight:700; to font-weight:bold;
there could be a filter which replaces class X with a set of styles
there could be a filter which unwraps from <b> element. (Moves node.children to node.parent and removes node)

Content Normalizers transforms View on its own layer and might be similar to the concept of post-fixers. The idea is to have a mechanism, which is easy to maintain and extend, to process pasted data from different sources. Its role is to transform quirks from different office-like application to a standard shape, which later will be nicely upcast to the model.

Examples

Below you can find a few examples of how the filter's definition could look like. First is defined activationTrigger, which determines for what type of content normalizer should be applied. Then are registered some filters for given normalizer. In further steps inside PFO, when inputTransformation event is processed by PFO, the input data are passed to each normalizer. If input data fulfils the condition of the activation trigger, then normalizer is activated and data are transformed with filters.

Filters with check - filter object containes 2 properties exec and check


const normalizer = new ContentNormalizer( {
activationTrigger: contentString => /id=("|')docs-internal-guid-[-0-9a-f]+("|')/.test( contentString )
} );

normalizer.addFilter( { check: 'span', exec: ( writer, node ) => { if ( node.getStyle( 'font-weight' ) === '700' ) { writer.setStyle( 'font-weight', 'bold', node ); } } } );

normalizer.addFilter( { check: node => node.is( 'element' ) && node.hasClass( 'foo' ), exec: ( writer, node ) => { writer.removeClass( 'foo', node ); writer.setAttribute( 'foo', 'bar', node ); } } );


2. Filters without `check`, filter object has only 1 property which should early return for invalid elements.
```js
const normalizer = new ContentNormalizer( {
    activationTrigger: contentString => /id=("|')docs-internal-guid-[-0-9a-f]+("|')/.test( contentString )
} );

normalizer.addFilter( {
    exec: ( writer, node ) => {
        if ( node.is( 'element:span' ) && node.getStyle( 'font-weight' ) === '700' ) {
            writer.setStyle( 'font-weight', 'bold', node );
        }
    }
} );

normalizer.addFilter( {
    exec: ( writer, node ) => {
        if ( !node.is( 'element' ) || !node.hasClass( 'foo' ) ) {
            return;
        }

        writer.removeClass( 'foo', node );
        writer.setAttribute( 'foo', 'bar', node );
    }
} );

Filters as a callback function - similar to 2. but without object wrapper.


const normalizer = new ContentNormalizer( {
activationTrigger: contentString => /id=("|')docs-internal-guid-[-0-9a-f]+("|')/.test( contentString )
} );

normalizer.addFilter( ( writer, node ) => { if ( node.is( 'element:span' ) && node.getStyle( 'font-weight' ) === '700' ) { writer.setStyle( 'font-weight', 'bold', node ); } } );

normalizer.addFilter( ( writer, node ) => { if ( !node.is( 'element' ) || !node.hasClass( 'foo' ) ) { return; }

    writer.removeClass( 'foo', node );
    writer.setAttribute( 'foo', 'bar', node );

} );


4. Filter for MS Word - which overwrites entire data.content. Here is used `addFilter` method the same which is used three previous points. However, the object contains a flag, that filter should be applied in a different way.
```js
const normalizer = new ContentNormalizer( {
    activationTrigger: contentString =>
        /<meta\s*name="?generator"?\s*content="?microsoft\s*word\s*\d+"?\/?>/i.test( contentString ) ||
        /xmlns:o="urn:schemas-microsoft-com/i.test( contentString )
} );

normalizer.addFilter( {
    type: 'fullContent',
    exec: data => { // <--- here is passed entire input data object
        const html = data.dataTransfer.getData( 'text/html' );

        data.content = _normalizeWordInput( html, data.dataTransfer );
    }
} );

return normalizer;

Filter for MS Word with the different method name and as a callback. Here is used a dedicated method in the normalizer API addFullContentFilter.


const normalizer = new ContentNormalizer( {
activationTrigger: contentString =>
    /<meta\s*name="?generator"?\s*content="?microsoft\s*word\s*\d+"?\/?>/i.test( contentString ) ||
    /xmlns:o="urn:schemas-microsoft-com/i.test( contentString )
} );

normalizer.addFullContentFilter( data => { // <--- here is passed entire input data object const html = data.dataTransfer.getData( 'text/html' );

    data.content = _normalizeWordInput( html, data.dataTransfer );

} );

return normalizer;


## Questions:

As I start to implement some POC, I've encountered a few things requiring a discussion.

1. Structure of a filter.
How the filter should look like? Should it be an object? If so, then should there be some `check`s for the filter? What to do with MSWord, right now it requires entire `data` object. Rewriting it would be a quite big refactoring task. Should it share `filter` API, or have a special type of filters with data as an argument?

2. Filters order and priority.
At first I tried to implement all filters at once to a single element, however, it showed really fast drawbacks of this approach. If filter modifies a DocumentFragment structure, then other filters might skip those elements. For example, if there will be some kind of unwrap method, which will move element's children to element's parent, then elements added before it, won't be processed again as those will be located in the already processed structure. That is why a safer approach could be to apply each filter separately for entire input data recursively.
Some example how it could looks like:
```js
// this._filters <-- a set stores filters
// this.data <-- keeps a reference to the data from inputTransformation event

_applyFilters() {
    if ( !this._filters.size ) {
        return;
    }

    const writer = new UpcastWriter();

    this._transformNodes( writer, this.data.content );
}

_transformNodes( writer, node ) {
    for ( const filter of this._filters ) {
        if ( typeof filter.check == 'function' && filter.check( node ) || node.is( filter.check ) ) {
            filter.exec( writer, node );
        }

        if ( ( node.is( 'element' ) || node.is( 'documentFragment' ) ) && node.childCount ) {
            for ( const childNode of node.getChildren() ) {
                this._transformNodes( writer, childNode );
            }
        }
    }
}

All thoughts about those Content Normalizers are welcome.

For me, it looks like you create a special class to write some logic using special syntax. I am not sure if it simplifies the code or complicates it even more. check is just an "if", exec is just a "then", addFilter does not do much more then just executing one part after another. Also, I do not know how activationTrigger suppose to work on the view structure, which is a tree.

I think, that instead of creating an uber-tool to define the whole logic, you should create a bunch of helper to simplified parts of logic which repeat or is not easy to read.

You compare it to postFixers but note that the only reason, why we have postFixers is to restart all fixes defined by different plugins whenever one of then change something. We would use simple events if it would not be the case.

Can you link the code you want to simplified using this tool?

@pjasiun So the base idea was here: https://github.com/ckeditor/ckeditor5-paste-from-office/pull/62#pullrequestreview-263679141 - to group the logic for different office suites / programs in separate classes (Normalizers) - those packs two things:

Check for whether the logic should be run for this clipboard input (so activationTrigger())
The logic itself (Normalizer.exec()).

Now on top of it, if I get this right, @msamsel is proposing to rewrite logic behind content normalization itself as currently, you cannot reuse some logic from Word input processing in Google docs processing.

Now how to re-use this logic is a topic here.

A few thoughts from my side.

First of all, I don't feel that some sugar syntax is necessarily wrong. I am fine with defining those filters this or a similar way. But maybe having new classes is unnecessary. As PJ mentioned, this might be a helper function that hooks a callback to something.

Actually, PJ, you mention the original idea behind post-fixers and this might be the case here too. After you normalize some content, it might appear that it needs/can be further normalized. This is just an idea without an example though.

One thing I simply don't understand is why we would need activationTrigger. This seems like a not-needed complication. When you have a view element that is incorrect in some way, it doesn't really matter if you got it from Google or MS Docs or whatever, right? Can you post an example where it matters?

Some of the examples are also something that we wanted and planned to solve in a different way. For example, converting view font-weight:700 to bold was supposed to be done through view to model conversion, by specifying appropriate converters. Of course, it could also be done thanks to view normalization. I could even agree that view normalization might be more clear/understandable to somebody than having multiple views converting to the same model values. But for those examples, it is adding an additional step, which we already had a different idea for.

I'd maybe still see some usage for this, though. Things like expanding some CSS styles:

margin: 10px to margin-left: 10px; margin-right: 10px; margin-top: 10px; margin-bottom: 10px. This is something that MAYBE would be easier to be done here than in conversion.

Now on top of it, if I get this right, @msamsel is proposing to rewrite logic behind content normalization itself as currently, you cannot reuse some logic from Word input processing in Google docs processing.

This normalization might make sense for changing things like <span class="mso-list ... "> but still, are there any conflicting code between Google/MS pastes? So that you need to keep them separated?

Still, it can be just a function which gets the view structure and normalizes it.

are there any conflicting code between Google/MS pastes? So that you need to keep them separated?

I think that we want to know which filters use for each content. I'd rather don't want to run all the things for all the content types. At the moment the PFO only runs for Word input (so this activationTrigger() is already there right now it is a basic programing contruct - you know if ;) (sorry for an inside joke) ).

As for having two classes: one that normalize Word and one that normalize Google Office input - I think that it is OK. I miht be wrong after we check which filters are used by both Word and Google Office but at hte beginnning why don't have clear view which helpers (filters) are used in each of them? Also normalizing input form one tool might be less consuming then from other so why filter for everything if we don't have to?

@msamsel: after sam talks with @pjasiun about the potential problems with fixing content it looks like it might be too early to propose general API for filters. As each filter must operate on whovie view - considering structural changes, moving things around, checking parents/siblings - I think that the filter methods should receive current DocumentFragment and modify it using UpcastWriter. The third parameter might be whole dataTransfer object (as some transforamtion might use it).

At the moment we cannot tell if any future paste-from-something will not have conflicting code so I'd keep them separated.

ATM, if the conversation doesn't go other direction I would have two simple classes that will import common filters if needed. The filters probably should have similar (same) definition to make them reausable.

import removeBold from './filters/googledocs/removebold.js';
import { saveRTFImages } from './filters/images.js'
import fixLists from './filters/fixlists.js';
import foo from './filters/foo.js';
import bar from './filters/bar.js';

/**
 * @implements module:paste-from-office/pastefromoffice~Normalizer
 */
class GoogleDocsNormalizer {
    activationTrigger( html ) {
        return html.contains( 'foo' );
    }

    execute( documentFragment, writer, dataTransfer ) {
        removeBold( documentFragment, writer );
        saveRTFImages( documentFragment, writer, dataTransfer );
        fixListsBold( documentFragment, writer );

        // Some filters might be so simple that you can call them in a loop (I think :) )

        // ... get nodes
        for( const node of nodes ) {
            foo( node, writer );
            bar( node, writer );
        }
    }

So it seems that each filter might have the same code duplicated (e.g. iteration part) - but I think that we're fine with this right now as this is safer then post-fixer approach that runs all post-fixer if some other post-fixer returns true. Here each filter would iterate over a document fragment once.

I agree with @pjasiun. If this tool doesn't do anything more than being a big if ( check() ) exec() runner then it feels like unnecessary structure for something that will benefit from less of structure. No structure means no limitations to how filters are applied. This means you can choose the best solution for the job.

Therefore, a set of (somehow standardised?) helper functions executed one by one, just like @jodator shown in the last comment sounds like a better plan (at least for now).

The only reason that I see for developing such a filter executor would be the performance. Having every helper function doing the whole tree-traversal is far from being optimal. Here, a filter executor could help by optimizing the places where it applies its filters. However, @msamsel you identified the biggest problem with such a tool:

For example, if there will be some kind of unwrap method, which will move element's children to element's parent, then elements added before it, won't be processed again as those will be located in the already processed structure. That is why a safer approach could be to apply each filter separately for entire input data recursively.

That's exactly the issue with CKEditor 4's HTML processor. It comes with unclear limitations to what you can do in its filters and when you cross those limitations then suddenly some of your filters will not be applied.

The other problem was indeed related to priorities. When you have 20+ filters, as in the old PFW implementation, this starts to play a role. At some point, limited by the filter executor's framework, you may start creating relations between those filters. If one filter expects some other filter to be executed beforehand, you start prioritising them and it makes the whole thing even more complex.

There has been no activity on this issue for the past year. We've marked it as stale and will close it in 30 days. We understand it may be relevant, so if you're interested in the solution, leave a comment or reaction under this issue.

We've closed your issue due to inactivity over the last year. We understand that the issue may still be relevant. If so, feel free to open a new one (and link this issue to it).

ckeditor / ckeditor5

The Content Normalizers API #2497

Description

Examples