fivefilters / ftr-site-config

Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.
https://www.fivefilters.org/full-text-rss/
Other
366 stars 255 forks source link

replace_string for <tag> also replaces </tag>. That is nice, but is this safe? #1048

Closed HolgerAusB closed 9 months ago

HolgerAusB commented 1 year ago

@fivefilters, @j0k3r : I just stumbled upon a nice feature when using replace_string to change html-tags. If I change a html tag to a different tag, the matching closing-tag is also changed magically in Fulltext-RSS.

So my questinion is: Will this work with other projects using this site_config as well? graby, wallabag, others? Or might it cause problems, if someone does not explicitly replace the closing tag too. Or might there be problems with FTR/P2K too?

Example:

Original html:

<h2 class="clay-subheader" ...>
   <span class="ordered-list-item">
      <p class="list-item-text">1.</p>
   </span>
   You don’t have to read everyone’s book.
</h2>

site-config:

find_string: <p class="list-item-text">
replace_string: <anydifferenttag>

result:

<h2 class="clay-subheader" ...>
   <span class="ordered-list-item">
      <anydifferenttag>1.</anydifferenttag>
   </span>
   You don’t have to read everyone’s book.
</h2>

single-line replacement also works as well as with tag fragments (missing parameters and >):

replace_string(<h2): <someuglytag

result:

<someuglytag class="clay-subheader" ...>
   <span class="ordered-list-item">
      <p class="list-item-text">1.</p>
   </span>
   You don’t have to read everyone’s book.
</someuglytag>
fivefilters commented 1 year ago

Hi Holger, to be clear Full-Text RSS itself doesn't do anything magic here. It simply carries out the string replacement. So in your example, the <p> replacement on the input HTML you provided:

<h2 class="clay-subheader" ...>
   <span class="ordered-list-item">
      <p class="list-item-text">1.</p>
   </span>
   You don’t have to read everyone’s book.
</h2>

becomes the following after string replacement

<h2 class="clay-subheader" ...>
   <span class="ordered-list-item">
      <anydifferenttag>1.</p>
   </span>
   You don’t have to read everyone’s book.
</h2>

The magic that you're seeing happens with the HTML parser when it parses the above. Full-Text RSS relies on HTML5-PHP to parse HTML, which tries to follow HTML5 parsing rules. It's kind of the same way a browser will try to make sense of a malformed HTML document.

I think the ideal way to deal with such changes is with more advanced rules, some of which the original Instapaper rules these are based on supported. Things such as unwrap: XPath or move_into(), which would let you manipulate the DOM without string replacement. We haven't seen many cases where those are essential, so haven't added support for them.

Whether to use the string replacement method you've highlighted and rely on the parser to figure out the correct structure, I'm not sure. It's a bit hacky. I personally wouldn't do it for minor formatting improvements, but I don't have very strong feelings about it.

There are some site config files where we've used string replacement to signal an end to the document where a suitable XPath couldn't be found or it was just much simpler than constructing one to isolate the content. Imagine something like this:

replace_string(<!--end of article-->):</body></html>

We're not creating well-formed HTML, but we're hoping the HTML parser does what we want and ends the document at the point the comment is encountered and ignores all the other elements that follow.