RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
7.3k stars 1.03k forks source link

Try to scrap Facebook using their mbasic site #1570

Open somini opened 4 years ago

somini commented 4 years ago

Facebook has an mbasic site at https://mbasic.facebook.com/ which gives me cleaner data than the pseudo-mobile site scrapped by both existing bridges.

This is just a suggestion, since I can't see a nicer way to so this.

Pinging existing Facebook bridge maintainers @teromene @logmanoriginal, maybe this was already tried and abandoned?

triatic commented 4 years ago

It's always safer to scrape the primary site, www.facebook.com, since we never know when alternative sites might be removed.

m040601 commented 4 years ago

Interesting discussion.

_

It's always safer to scrape the primary site, www.facebook.com, since we never know when alternative sites might be removed.

_

yes, this sounds like good advice. These sites variants appear and disapear like crazy. I never even knew touch.facebook.com or mbasic.facebook.com existed. Some times I visit reddit.com and also discover that you can use mobile.twitter.com or sometimes is m.dot.something or wap.something.com or old.something.com. Or lite.duckduckgo.com or html.duckduckgo.com

But I also seem to have found other interesting details. Like to share some.

I've started recently to test RSSBridge and other competittors. My main interest is getting RSS out of Facebook.

I'm trying to find out exactly and thouroughly what and how do the different RSSBridge Facebook scraper variants, and their options and parameters choices result in. I don't see answers to these questions of mine, in the docs. I'm not a developer or understand PHP also. So feel free to add your comment if you have real experience and understanding.

Confirmed:

By choosing the "Fb2" scraper, called "Facebook Bridge | Touch Site" which seems to scrape touch.facebook.com, instead of the "Main" scraper that scrapes www.facebook,.com One example I got 20 items posted in 2020, and with FB2 it gave me 100 items going back to 2018. Tried different profiles and always keep getting these "20" and "100" items thing reproduced. Would eventually scraping mbasic.facebook.com get me even more and older items ? Good question. Maybe there's even a "secret" facebook variant that you can scrape years of old posts :-) ?

Still Need more Testing:

Otherwise I'm pretty satisfied with the job RSSBridge makes with Facebook. Or better said, amazed, that this can still be done in 2020.

A big Thank You to all the guys who maintain these Facebook scrapers. Great job.

Tried also other python options like, https://github.com/irfancharania/fb-feed-gen But get less results

cvtsi2sd commented 3 years ago

In the last months I've been getting worse and worse results with the "regular" FB bridge - it looks like the HTML served is increasingly dirty (actual content difficult to separate from surrounding stuff, problems like #1774, ...). FB2 works intermittently - I generally get good results, but often it fails with "Unable to get the page id.", even providing the page ID directly, so I think it may be related to rate limiting. Maybe it's time to investigate mbasic more thoroughly?

About its future availability, I somewhat hope that it's been tied to some "weak" embedded device, so it may be here to stay at least for some while. OTOH the main site is subject to continuous redesign, so it's kind of a moving target anyway?