RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
7.35k stars 1.04k forks source link

Auto/Custom site scrapper #225

Closed GreenLunar closed 8 years ago

GreenLunar commented 8 years ago

There is an online service called Feed Creator which scraps pages according to parameters defined by user.

I suggest to do the same with Automatic (guesses) and Custom (user defined) modes, which would be useful to those who not feeding themselves with the big and so called "popular" websites that are subjected by rss-bridge.

GreenLunar commented 8 years ago

Do any of you familiar with XPath? I am not proficient in PHP, yet this is fairly easy to implement using the contains function of XPath.

a) Use contains function to:

b) Find common nodes (that is, node()) that share the given parameters, and count them. Each node() represents a feed entry. c) Scrap page!

The above service I have linked to, does not appear to use function contains, and fails when an attribute has a white-space in it.

https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/contains http://www.w3.org/TR/xpath/#function-contains

logmanoriginal commented 8 years ago

Do any of you familiar with XPath?

Yes, have a look at this for example.

I am not proficient in PHP, yet this is fairly easy to implement using the contains function of XPath.

a) Use contains function to:

locate a @class or @id attribute; and/or locate a @href attribute.

b) Find common nodes (that is, node()) that share the given parameters, and count them. Each node() represents a feed entry. c) Scrap page!

Yes, it is most certainly possible. Just keep in mind that the more generic it gets, the harder it is to predict its behavior in specific situations. Designing bridges for specific sites allows a prediction of what data is actually loaded to the servers and how it is processed. This is almost impossible to do if you can provide random URLs.

That's because whatever URL is provided to a bridge must be downloaded in order to make use of its contents. Now imagine me providing a link to a file instead of an HTML page. The bridge would actually download the file, even though this is clearly not intended by RSS-Bridge and certainly not by the server owner (as the download may contain viruses or illegal software).

By the way, I'm pretty sure there is a way around this (for example by analyzing the HTTP header). There is of course much more to take care of, like invalid HTML and JavaScript or content like videos or ads.

That being said, there are many bridges you can use for reference and the PHP community is very active and supportive. Also the PHP documentation is very well made. Of course there are a lot of forums out there, or in case of more specific questions the GitHub issues.

The above service I have linked to, does not appear to use function contains, and fails when an attribute has a white-space in it.

If I understand you correctly, you make use of that service and got an issue with those attributes. Have you tried getting in contact with the site owner? Since they sell it as a product they should be interested in keeping it as functional as possible.

GreenLunar commented 8 years ago

Thank you for the reference, and for the comprehensive comment.

Yes, it is most certainly possible. Just keep in mind that the more generic it gets, the harder it is to predict its behavior in specific situations.

Yes, I am aware of this issue. A generic solution is good mostly in cases which user does not care much for style of content, which matters indeed, but for being uniformly notified of updates from specified pages.

If I understand you correctly, you make use of that service and got an issue with those attributes. Have you tried getting in contact with the site owner? Since they sell it as a product they should be interested in keeping it as functional as possible.

Yes, I did, but now I am using my own page scrapper, which is not as extensive as RSS with full text of Feed Creator (i.e. in matter of downloading content from a page linked by an entry and displaying the relevant content as entry description), yet it does what I need it to do with several extra features.

I have made a couple of attempts to post about it on their support channel, but my posts have never appeared. I can only guess that they have this white-space issue fixed in their paid plans. By the way, the source code of their webapp is licensed under AGPL.

@fivefilters

ArthurHoaro commented 8 years ago

Yes, have a look at this for example.

Note that using XQueries requires either having libxml PHP extension installed, working on a strictly valid XML file, or using a 3rd party library which can load the DOM. While using regular expressions with HTML isn't really recommended, it's easier to set up.

In your example, the bridge use libxml. It might be a good idea to mention it in the requirements.

GreenLunar commented 8 years ago

working on a strictly valid XML file, or using a 3rd party library which can load the DOM

XPath alone appears to do the job, and it appears that XPath of python module lxml works well even with broken HTML files, without the need of to using BeautifulSoup to correct broken areas. Maybe this is also the case with PHP.

logmanoriginal commented 8 years ago

In your example, the bridge use libxml. It might be a good idea to mention it in the requirements.

Good point, I never thought about it 😊 => Wiki updated

XPath alone appears to do the job, and it appears that XPath of python module lxml works well even with broken HTML files, without the need of to using BeautifulSoup to correct broken areas. Maybe this is also the case with PHP.

Yes, the PHP module will load heavily broken HTML files (and generate lots of errors while it is at it). This is why you need libxml_use_internal_errors(true); in order to prevent those errors from stopping your script.