Closed HolgerAusB closed 9 months ago
Hi Holger, to be clear Full-Text RSS itself doesn't do anything magic here. It simply carries out the string replacement. So in your example, the <p>
replacement on the input HTML you provided:
<h2 class="clay-subheader" ...>
<span class="ordered-list-item">
<p class="list-item-text">1.</p>
</span>
You don’t have to read everyone’s book.
</h2>
becomes the following after string replacement
<h2 class="clay-subheader" ...>
<span class="ordered-list-item">
<anydifferenttag>1.</p>
</span>
You don’t have to read everyone’s book.
</h2>
The magic that you're seeing happens with the HTML parser when it parses the above. Full-Text RSS relies on HTML5-PHP to parse HTML, which tries to follow HTML5 parsing rules. It's kind of the same way a browser will try to make sense of a malformed HTML document.
I think the ideal way to deal with such changes is with more advanced rules, some of which the original Instapaper rules these are based on supported. Things such as unwrap: XPath
or move_into()
, which would let you manipulate the DOM without string replacement. We haven't seen many cases where those are essential, so haven't added support for them.
Whether to use the string replacement method you've highlighted and rely on the parser to figure out the correct structure, I'm not sure. It's a bit hacky. I personally wouldn't do it for minor formatting improvements, but I don't have very strong feelings about it.
There are some site config files where we've used string replacement to signal an end to the document where a suitable XPath couldn't be found or it was just much simpler than constructing one to isolate the content. Imagine something like this:
replace_string(<!--end of article-->):</body></html>
We're not creating well-formed HTML, but we're hoping the HTML parser does what we want and ends the document at the point the comment is encountered and ignores all the other elements that follow.
@fivefilters, @j0k3r : I just stumbled upon a nice feature when using
replace_string
to change html-tags. If I change a html tag to a different tag, the matching closing-tag is also changed magically in Fulltext-RSS.So my questinion is: Will this work with other projects using this site_config as well? graby, wallabag, others? Or might it cause problems, if someone does not explicitly replace the closing tag too. Or might there be problems with FTR/P2K too?
Example:
Original html:
site-config:
result:
single-line replacement also works as well as with tag fragments (missing parameters and
>
):result: