RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
7.36k stars 1.04k forks source link

Facebook Bridge generates invalid XML from a particular profile page #1530

Open arcctgx opened 4 years ago

arcctgx commented 4 years ago

Describe the bug My RSS reader (Liferea-1.12.7) is reporting a parse error for Atom feed created by RSS-Bridge from Facebook profile https://www.facebook.com/mglaofficial/. Mozilla Firefox also reports parsing error.

This is the only RSS feed generated by RSS-Bridge I'm having problems with. This particular feed was working well before the update on April 7th was posted.

To Reproduce Steps to reproduce the behavior:

  1. Go to https://bridge.suumitsu.eu/?action=display&bridge=Facebook&context=User&u=mglaofficial&media_type=all&limit=-1&format=Atom
  2. Select "Open with Firefox".
  3. See XML parse error.

Expected behavior XML document tree is displayed. No parse errors are reported.

Additional context I downloaded the XML generated by RSS-Bridge and ran it through xmllint with the following result:

$ xmllint mglaofficial.xml 
mglaofficial.xml:22: parser error : PCDATA invalid Char value 3
t;p> 23.10. Oberhausen...<br /> 24.10. Sneek <br /> 26.10. London
                                                                               ^
mglaofficial.xml:22: parser error : PCDATA invalid Char value 3
 /> 27.10. Manchester <br /> 28.10. Glasgow <br /> 29.10. Belfast
                                                                               ^
mglaofficial.xml:22: parser error : PCDATA invalid Char value 3
t;br /> 30.10. Dublin <br /> 31.10. Birmingham <br /> 1.11. Lille
                                                                               ^
mglaofficial.xml:22: parser error : PCDATA invalid Char value 3
ille<br /> 3.11. Paris <br /> 4.11. Arlon <br /> 5.11. Zurich

The problem is that there are ASCII control characters 0x03 (^C, ETX) embedded in the content of the April 7th post, right after words "London", "Belfast", etc. They seem to cause the XML parse errors. After manually removing the 4 occurences of this character, neither xmllint, Liferea nor Firefox complain anymore.

While I understand it's difficult to fully sanitize any arbitrary input, maybe something could be done to handle the lower-ASCII control sequences?

somini commented 4 years ago

Why the hell does Facebook allow control characters on posts, they don't strip this? This is crazy. It's probably OK to strip these kind of characters for all feeds.

arcctgx commented 3 years ago

As a workaround Liferea users can create a conversion filter to remove these control characters until this issue is fully addressed.

#!/bin/bash

# Remove lower-ASCII control characters: NUL-ACK and SO-US.
# Familiar control sequences \[abtnvfr] (BEL-CR) are not changed.
tr --delete '\0-\6\16-\37'

Save this to a file and make it executable. Then tell Liferea to use this filter for affected feed (right-click feed, select "Properties", go to "Source" tab, check "Use conversion filter", "Select file"). This might not be effective until you restart Liferea.

dvikan commented 2 years ago

This is a rare edge case but I think we can strip away these characters. There might be some rss xml rules regarding escaping these?