Some RSS Feeds may not validate correctly if actual content (like from articles) includes invalid HTML

eSilverStrike commented 4 years ago

According to

https://validator.w3.org/feed/

We have some invalid RSS feeds

Not Valid https://www.geeklog.net/backend/geeklog.rss https://www.geeklog.net/backend/forum.rss

Valid https://www.geeklog.net/backend/security.rss https://www.geeklog.net/backend/comments.rss

eSilverStrike commented 4 years ago

Okay the rss feeds had 2 different validation errors on Geeklog.net. I have manually fixed them.

One was a configuration error, the result of the rdf_path not including the path_html exactly in it. The path was technically correct in the rdf_path but in SYND_getFeedUrl we do a comparison check and since it was not exactly the same the url path was created incorrectly which resulted in the path error in the actually rss file. I made note of this in the config docs for the rdf_path.

The second was regarding actual html being incorrect in an article which wasn't fixed before the feed was created which then resulted in the feed not validating. Here was the error in the article, notice the extra " at the start of the href in the link tag.

<p>Remember if you would like to chat with any of the community one of the best place to reach us is on Gitter in the <a href=""https://gitter.im/Geeklog-Core/geeklog" target="_blank">Geeklog room</a>.</p>

I manually fixed this in the articles by editing them but I think we should actually fix any HTML mistakes when we create the feed... (is this possible for other feeds as well like comments and plugin feeds?????)

In lib-syndication where we load the articles with functions like SYND_getFeedContentAll and SYND_getFeedContentPerTopic. We should update these functions to include the ability to fix any HTML errors since we already loop through the articles anyways doing other fixes.

Here is some sample code below that I thought we could add that could do this:

$articletext = "some html content that may be incorrect"
$x = new DOMDocument;
libxml_use_internal_errors(true);
$x->loadHTML($articletext);
$clean = $x->saveHTML();

We should also do this for comments and maybe even plugin content (or should we let the plugins worry about their own content)

mystralkk commented 3 years ago

I tried to fix the wrong HTML text in four ways.

The original HTML text (</p> tag is missing at the end):

<p>Remember if you would like to chat with any of the community one of the best place to reach us is on Gitter in the <a href=""https://gitter.im/Geeklog-Core/geeklog" target="_blank">Geeklog room</a>.

COM_truncateHTML():

<p>Remember if you would like to chat with any of the community one of the best place to reach us is on Gitter in the <a href=""https://gitter.im/Geeklog-Core/geeklog" target="_blank">Geeklog room</a></p>

htmlawed:

<p>Remember if you would like to chat with any of the community one of the best place to reach us is on Gitter in the <a href="">Geeklog room</a>.</p>

DOMDocument (in the way you suggested):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p>Remember if you would like to chat with any of the community one of the best place to reach us is on Gitter in the <a href="" https: target="_blank">Geeklog room</a>.</p></body></html>

tidy:

<!DOCTYPE html> <html> <head> <title></title> </head> <body> <p>Remember if you would like to chat with any of the community one of the best place to reach us is on Gitter in the <a href="" target="_blank">Geeklog room</a>.</p> </body> </html>

It seems we cannot fix the wrong HTML above automatically with these methods. All we can do is try to write the right HTML text in the first place or use WYSIWYG editors like CKEditor.

eSilverStrike commented 2 years ago

@mystralkk Doesn't your code show that both COM_truncateHTML and htmlawed worked?

I added a few calls into lib-syndication.php using htmlawed and it seems to be working fine both on articles and on any other content for rss feeds (like comments) that I had edited to introduce missing tags.

mystralkk commented 2 years ago

Well, then, it would be best to use htmlawed to fix the wrong HTML.

Geeklog-Core / geeklog

Some RSS Feeds may not validate correctly if actual content (like from articles) includes invalid HTML #1014