Is the "<" or the ">" character always escaped in the database?

Zodiac1978 commented 5 months ago

Hey @mcsf - thanks again for our great chat at WordCamp Porto last weekend!

I've just tried to insert ">" in the alt attribute of the image block (WordPress 6.5.3) and it looks like the ">" is encoded correctly as ">", but the "<" is indeed not modified.

Even worse, in the caption both are not modified ...

This is from the database:

<!-- wp:image {"id":12,"sizeSlug":"full","linkDestination":"none"} -->
<figure class="wp-block-image size-full"><img src="http://localhost:10139/wp-content/uploads/2024/05/example.jpg" alt="Alt Attribute with &gt; or < what happens?" class="wp-image-12"/><figcaption class="wp-element-caption">Caption < test ></figcaption></figure>
<!-- /wp:image -->

It looks like that the unmodified "<" is not breaking the regex, but the second issue, the unmodified "<" is preventing the search for "test" in my case above: https://regex101.com/r/3moGTD/1

In my test case, I was also wondering that the alternative text from the media library wasn't used for the image block. Is this the intended behavior?

mcsf commented 5 months ago

Hey, @Zodiac1978! Likewise, it was a pleasure meeting you!

Character escaping

I tested this myself because I was surprised by your report that characters were not escaped in the caption's text content. Everything else was more or less expected. Here is where I misled you: I told you about the block editor's serialiser (which escapes > but not < inside values), which kicks in to process all block data into a string for post_content, but I forgot to mention that there is more code than runs on the server that processes content before it is written to the database. Historically this was useful, but nowadays it leads to strange edge cases.

You can observe the discrepancies between what the client does and what the server does by opening the block editor, adding content, then switching to code mode, and comparing that markup with what you get later from the database.

What you describe sounds a lot like https://github.com/WordPress/gutenberg/issues/15636, where the culprit appears to be wptexturize. As noted in the linked comment, no one really dares touch the server-side machinery because of all the unpredictable ways in which it can break WordPress (cf. Core ticket), and we've historically tried to cope on the editor side.

Bringing this back to the realm of your plugin: it's clear that we can't guarantee that something along the way hasn't produced malformed HTML when dealing with user content, but the best next thing you can do is assume that attribute values can contain < and > and guard against those accordingly. After all, the spec allows this.

Sourcing `alt` field from Media library

To the best of my knowledge, the Image block attempts to retrieve alt data when the user picks an image using the media modal, but has never been concerned with keeping that information synchronised with the media object. Is this what you are asking about? To recap the steps:

In the media manager, upload an image, then add alt text to it
In the block editor, insert an Image block
In the block's placeholder, click "Media library" to open the media modal
Select the image, closing the modal
Observe: The Image block's alt attribute corresponds to the value found in the media library
Go to the media library and change the alt text
Observe: The image block's alt attribute hasn't changed and thus no longer corresponds to the library value

I suppose that with the development of block bindings this becomes easily fixable: just declare the Image block's alt attribute to be "sourced" from a given media object. But I don't know if that has been discussed or is desirable, so I suggest bringing it up in Make WordPress Slack's #core-editor.

Zodiac1978 commented 5 months ago

the Image block attempts to retrieve alt data when the user picks an image using the media modal, but has never been concerned with keeping that information synchronised with the media object. Is this what you are asking about?

No, synchronized alternative text is not recommended, because one image could have different alternative texts in different places.

~~But in WordPress 6.5.3 your step number 5 is broken. The alternative text in the image block is empty.~~ Edit: My bad, looked at the wrong field ... it works as it should.

For the escaping issue this looks like a major blocker for my solution. Need to dig deeper. Thanks for the links and more context.

Zodiac1978 commented 5 months ago

I'm unsure if this long-standing problem should be solved by my little plugin. Thinking about going back to my first approach, just ignoring the HTML comments. This would make the RegEx much more robust: 

Additionally, this would leave the filename, alternative text and caption intact.

mcsf commented 5 months ago

I'm unsure if this long-standing problem should be solved by my little plugin. Thinking about going back to my first approach, just ignoring the HTML comments. This would make the RegEx much more robust: 

I think that's your best bet, yes! Actually, come to think of it, I think that the HTML spec reserves --> as a special string, so you can much more confidently make assumptions about it. I wouldn't say you can be 100% confident because who knows what could mangle HTML (as mentioned before), but it sounds pretty reasonable.

I suggest you take a look at one of our official parsers, specifically its tokenizer regular expression:

https://github.com/WordPress/gutenberg/blob/6fa347111e9714141c0a38711b518d9ef2485485/packages/block-serialization-default-parser/src/index.js#L89-L90

Please read the documentation attached to the expression. I don't know how compatible the expression is with the regex implementation of the database, but this one is definitely battle-tested. :)

Zodiac1978 commented 5 months ago

Thanks @mcsf - but the tokenizer still only affects the comment part: https://regex101.com/r/mT1bDV/1

It does not help me, getting the markup removed.

There is also this new WP HTML Tag Processor: https://developer.wordpress.org/reference/classes/wp_html_tag_processor/

But I have no idea how this would help me in this case. The only way would be to use an own table with the cleaned up content ... but that is a little bit too much for core I think.

MySQL has the ability to define functions itself. But the solution ChatGPT is suggesting is also just looking for "<" and ">", so we have the same issues if those are not reliable encoded to entities.

DELIMITER //

CREATE FUNCTION strip_html_tags(input_text TEXT) RETURNS TEXT
BEGIN
    DECLARE output_text TEXT DEFAULT '';
    DECLARE start_pos INT DEFAULT 1;
    DECLARE end_pos INT DEFAULT 0;
    DECLARE text_length INT DEFAULT CHAR_LENGTH(input_text);
    DECLARE in_tag BOOLEAN DEFAULT FALSE;

    WHILE start_pos <= text_length DO
        SET end_pos = LOCATE('<', input_text, start_pos);

        IF end_pos = 0 THEN
            SET output_text = CONCAT(output_text, SUBSTRING(input_text, start_pos));
            SET start_pos = text_length + 1;
        ELSE
            IF start_pos < end_pos THEN
                SET output_text = CONCAT(output_text, SUBSTRING(input_text, start_pos, end_pos - start_pos));
            END IF;

            SET start_pos = LOCATE('>', input_text, end_pos) + 1;
            IF start_pos = 0 THEN
                SET start_pos = text_length + 1;
            END IF;
        END IF;
    END WHILE;

    RETURN output_text;
END //

DELIMITER ;

Zodiac1978 / wp-search-ignore-block-names