FreshRSS / FreshRSS

A free, self-hostable news aggregator…
https://freshrss.org
GNU Affero General Public License v3.0
9.67k stars 823 forks source link

[Feature] XPath scraping: Retain HTML structure #4869

Closed mrnoname1000 closed 1 year ago

mrnoname1000 commented 1 year ago

Is your feature request related to a problem? Please describe. When scraping websites with XPath queries, it seems that the scraped HTML is stripped so it appears as plain text in the browser. This is fine for very simple pages, but it becomes ugly when scraping more complex pages with links or some other structure. For example, the SQLite release history page has an <ol> for each entry, and when scraped by FreshRSS it becomes a huge block of text with no structure.

Describe the solution you’d like HTML content scraped by XPath should retain its markup.

Describe alternatives you’ve considered A clear and concise description of any alternative solutions or features you’ve considered.

Additional context HTML scraped with CSS selectors is not escaped, so the issue can be worked around in most cases. RSS-Bridge has the same behaviour, but IMO not stripping HTML makes more sense. Also IMO the behavior should match between CSS selectors and XPath queries for entry content.

Here's an OPML export for scraping the website in question: sqlite.opml.xml.txt

Tested on latest edge 0ad8e6b

screenshot 2022-11-18 05:00:49

Alkarex commented 1 year ago

I will try to fix. Quick comment already: your article selector should be: following-sibling::ol[1] (without the [1] you copy n times the full document, where n is the number of entries, which becomes very large).

mrnoname1000 commented 1 year ago

That's true. I noticed only the first result is used so I figured it didn't matter but maybe there's a chance for optimization there.

Alkarex commented 1 year ago

@mrnoname1000 Please try https://github.com/FreshRSS/FreshRSS/pull/4878

image

mrnoname1000 commented 1 year ago

It works! However I'm getting a weird issue with the original query following-sibling::ol where only the last entry appears. I see you changed the code to save HTML from all returned nodes, so in theory the original query should put the entire rest of the page in an entry, but it doesn't. Maybe the page is just too big?

Also this appears to be a breaking change, so I think users should be notified of this before updating (maybe in the release notes).

Alkarex commented 1 year ago

The selector following-sibling::ol is most likely too big indeed. We always have a changelog, but I cannot immediately think of use-cases that would break due to this change.

mrnoname1000 commented 1 year ago

Well following-sibling::ol for one. It's incredibly inefficient but the original behavior discarded all but the first result and the new behavior, for whatever reason, neglects to return all but the last entry. Also for smaller pages where the scraper doesn't break, articles will suddenly have different content than expected. It might be fine if the content is the only change, but if queries are expected to break (even if only in edge cases), I think it deserves special mention.

Alkarex commented 1 year ago

I have just realised there was a mistake in one of my variables. Please try again https://github.com/FreshRSS/FreshRSS/pull/4878

mrnoname1000 commented 1 year ago

Now it works as expected, even with the original selectors! I think most of my points above are now moot since only the content of entries should change.