Closed mrnoname1000 closed 1 year ago
I will try to fix.
Quick comment already: your article selector should be: following-sibling::ol[1]
(without the [1]
you copy n times the full document, where n is the number of entries, which becomes very large).
That's true. I noticed only the first result is used so I figured it didn't matter but maybe there's a chance for optimization there.
@mrnoname1000 Please try https://github.com/FreshRSS/FreshRSS/pull/4878
It works! However I'm getting a weird issue with the original query following-sibling::ol
where only the last entry appears. I see you changed the code to save HTML from all returned nodes, so in theory the original query should put the entire rest of the page in an entry, but it doesn't. Maybe the page is just too big?
Also this appears to be a breaking change, so I think users should be notified of this before updating (maybe in the release notes).
The selector following-sibling::ol
is most likely too big indeed.
We always have a changelog, but I cannot immediately think of use-cases that would break due to this change.
Well following-sibling::ol
for one. It's incredibly inefficient but the original behavior discarded all but the first result and the new behavior, for whatever reason, neglects to return all but the last entry. Also for smaller pages where the scraper doesn't break, articles will suddenly have different content than expected. It might be fine if the content is the only change, but if queries are expected to break (even if only in edge cases), I think it deserves special mention.
I have just realised there was a mistake in one of my variables. Please try again https://github.com/FreshRSS/FreshRSS/pull/4878
Now it works as expected, even with the original selectors! I think most of my points above are now moot since only the content of entries should change.
Is your feature request related to a problem? Please describe. When scraping websites with XPath queries, it seems that the scraped HTML is stripped so it appears as plain text in the browser. This is fine for very simple pages, but it becomes ugly when scraping more complex pages with links or some other structure. For example, the SQLite release history page has an
<ol>
for each entry, and when scraped by FreshRSS it becomes a huge block of text with no structure.Describe the solution you’d like HTML content scraped by XPath should retain its markup.
Describe alternatives you’ve considered A clear and concise description of any alternative solutions or features you’ve considered.
Additional context HTML scraped with CSS selectors is not escaped, so the issue can be worked around in most cases. RSS-Bridge has the same behaviour, but IMO not stripping HTML makes more sense. Also IMO the behavior should match between CSS selectors and XPath queries for entry content.
Here's an OPML export for scraping the website in question: sqlite.opml.xml.txt
Tested on latest edge 0ad8e6b