Consider using `mbasic.facebook.com`

wvffle commented 2 years ago

Facebook is a heavy beast. You guys use puppeteer and wait a random timeout before scrolling down. This seems to be slow AF.

mbasic.facebook.com does not use any js files, it's simply a html page that could be parsed. No need for loading all of the images and videos while scrolling.

You can easily find posts with #m_group_stories_container > section > article selector.
Next page link can be found with #m_group_stories_container > section + div > a.

Also it would make implementing of #15 much easier.

Only downside I can see, is one additional request per post to get an url of full size image.

kaanyagci commented 2 years ago

Thank you @wvffle for your feedback! This is a very good point!

For the random timeout on scroll, the reason for that is to act like a human and to avoid the bot detection of Facebook. That being said it's slow AF as you've mentioned. I'm completely agree with that.
There was an other reason for the use of the desktop site, our older versions were using the mobile site. I think @iMrDJAi can give more information on it.

But if we can achieve all that we want to do, we can use this url. What do you think @iMrDJAi ?

iMrDJAi commented 2 years ago

TL;DR: We can use both!

@wvffle Thanks for the feedback.

First, we don't use a timeout at all! In fact the scraper at first waits for the page to be ready then scrolls down infinitely to load posts, it stops when no more of them are available and it continues once new ones are fetched.

Facebook has a strict rate limit for the desktop website, as a result this will affect the scraping speed, btw you can speed it up by authenticating since the userless mode has much worse rate limits, however the problem will always remains.

The reason why we used the desktop website at the first place was the content quality, it provides a higher images/videos resolution and more data compared by its other versions, you should take a look on this pull request to understand what I'm talking about https://github.com/Makepad-fr/fbjs/pull/55.

I completely understand your point, mbasic.facebook.com is a good idea, it technically has no rate limits that slow the scraper down, but the downside isn't only "one additional request per post to get an url of full size image", the More button on posts that have large text content will also redirect you to the post page instead of expending the text itself.

Finally, we don't have to use just one of them, we still can combine both, I think that's what I'll do next. Is that okay for you @kaanyagci ?

wvffle commented 2 years ago

the More button on posts that have large text content will also redirect you to the post page instead of expending the text itself.

That's true, I haven't thought on that. The videos also seem to be only in low quality.

For some use-cases mbasic would be a better option than the standard domain. For example I'd want to generate RSS feed of private groups. I need only 10 last posts per group, full content, images and videos are optional.

Search done with mbasic could return an object with async getAttachments() that would return an array of objects with image/video urls and their respective type. That way, you could load a full version of the post (from facebook.com) and get full size attachments if the user wants them. If not - they simply wouldn't run the function.

I've checked https://github.com/AllanWang/Frost-for-Facebook/ before to find if they're somehow downloading high quality videos, though, the issue seems to be unresolved. They're using m.facebook.com which uses some javascript though.

iMrDJAi commented 2 years ago

@wvffle You're right, the fact that we can grab the post ID will make it possible for us to return back to it at any time we want and from any version of the website we choose (unless it's deleted). We can use methods to obtain more data upon user needs, this is exactly how it would be.

At the beginning we were using m.facebook.com, but since it's so limited we switched to www.facebook.com, for me I'm trying to repost submissions from Facebook groups to subreddits to gain a connected experience, this use-case requires to fetch full post data and high quality media. That why we did that migration.

The domain mbasic.facebook.com seems enough for your use-case, but there is still an issue I couldn't find a solution for it yet, I can't find how to set the sorting method for this version of Facebook (https://github.com/Makepad-fr/fbjs/issues/57), it's stuck at New Activity and you'll probably need a chronological sorting for your RSS feed.

It's hard to mess with Facebook CDN links. You wouldn't be able to simply generate HD download URL for a low resolution video/image on m.facebook.com or mbasic.facebook.com, each link has it's own format and signature which prevents you from doing that. There is no known way to achieve this unfortunately.

For videos I'm thinking about using this implementation https://github.com/Makepad-fr/fbjs/issues/34 since the desktop version will only provide you blob URIs and hides the original download URLs.

kaanyagci commented 2 years ago

TL;DR: We can use both!

@wvffle Thanks for the feedback.

First, we don't use a timeout at all! In fact the scraper at first waits for the page to be ready then scrolls down infinitely to load posts, it stops when no more of them are available and it continues once new ones are fetched.

Facebook has a strict rate limit for the desktop website, as a result this will affect the scraping speed, btw you can speed it up by authenticating since the userless mode has much worse rate limits, however the problem will always remains.

The reason why we used the desktop website at the first place was the content quality, it provides a higher images/videos resolution and more data compared by its other versions, you should take a look on this pull request to understand what I'm talking about https://github.com/Makepad-fr/fbjs/pull/55.

I completely understand your point, mbasic.facebook.com is a good idea, it technically has no rate limits that slow the scraper down, but the downside isn't only "one additional request per post to get an url of full size image", the More button on posts that have large text content will also redirect you to the post page instead of expending the text itself.

Finally, we don't have to use just one of them, we still can combine both, I think that's what I'll do next. Is that okay for you @kaanyagci ?

That sound great to me!

iMrDJAi commented 2 years ago

That sound great to me!

@kaanyagci Ok then, in that case getGroupPosts() and parsePost() would have 2 modes: desktop and mbasic.
A new method will be added to the Post interface: .refresh(), which opens the desktop version of the post page and uses parsePost() to update its data.
Any more suggestions?

kaanyagci commented 2 years ago

Sorry I was a little busy this weekend. I read the discussion one more time, and I think it will be just an overkill to implement mbasic.facebook.com.

First of all, we will still have a random sleep for the scrolling to avoid bot detection, so it will not be any faster as soon as you have a good internet connection to load images. To improve the speed, there's also an option to block all assets from network communications. When this option is enabled, the browser will load only the text content to be faster.

For the @wvffle's RSS feed request, he came with the same request a while ago. To be honest, I don't understand the use-case of the RSS feed output from an npm module. An RSS feed should respond instantly and not in several minutes because the RSS request triggered a scraper. If you really want an RSS output, you can push the output of the scraper in a database like MySQL and implement another service to get the RSS feed from this database. You can run a cron job to update the database regularly, and your RSS feed requests respond instantly because it will be just a read from the database.

So, if we must alternate between 2 different websites for no relevant reason, I think this will be just a waste of time.

iMrDJAi commented 2 years ago

@kaanyagci

First of all, we will still have a random sleep for the scrolling to avoid bot detection, so it will not be any faster as soon as you have a good internet connection to load images.

That's not the case for mbasic.facebook.com, since it provides server-side pre-rendered HTML pages you won't scroll, you can just parse the HTML to extract the data, then grab the next page link from the "See More Posts" button, then start over again. In fact this doesn't require puppeteer at all, you can simply achieve this by performing HTTP requests. I think mbasic.facebook.com is reasonable if you want speed.

For the @wvffle's RSS feed request, he came with the same request a while ago. I don't understand the use-case of the RSS feed output in the npm module. You can push the output of the scraper in a database like MySQL and implement another service to get the RSS feed from this database. You can run a cron job to update the database regularly, and your RSS feed requests respond instantly because it will be just a read from the database.

You're totally right. @wvffle for an RSS feed you should use caching instead, but as I told you before, mbasic.facebook.com won't help you in this case, you need a chronological sorting for your RSS feed and this version of Facebook doesn't provide that.

kaanyagci commented 2 years ago

For the private groups we will still need the login process and the cookie injection.

Although, as we need to send a request to get the post details each time we want to get the whole text or comments on the post, I think Facebook will detect a suspicious behavior if we send a hundreds of requests in a very short time 🤔

iMrDJAi commented 2 years ago

Although, as we need to send a request to get the post details each time we want to get the whole text or comments on the post, I think Facebook will detect a suspicious behavior if we send a hundreds of requests in a very short time 🤔

@kaanyagci That's right, this is not a good idea.

kaanyagci commented 2 years ago

I'll close the issue then. Please feel free to re-open the issue or create a discussion topic for any kind of ideas on this or other subjects.

Makepad-fr / fbjs

Consider using `mbasic.facebook.com` #61