Closed iMrDJAi closed 3 years ago
HI,
Thank you for your contribution. I changed the base branch from master
to development
as there are still things uncompleted. I'll try to take a closer look probably this weekend.
The markdown on post contents using another npm module could be an overkill also may be a maintainability problem. But we can implement something which removes all HTML tags around the String content not necessarily converting to a markdown input.
HI,
Thank you for your contribution. I changed the base branch from
master
todevelopment
as there are still things uncompleted. I'll try to take a closer look probably this weekend.The markdown on post contents using another npm module could be an overkill also may be a maintainability problem. But we can implement something which removes all HTML tags around the String content not necessarily converting to a markdown input.
@kaanyagci The point of the markdown format is actually providing users a minimal output can be used to re-visualize posts content in the exact same way as the original, I only suggested that to avoid including the HTML format as it may be large and not human readable.
But I think you're right, we should reduce the number of the third party modules, also it's a good idea to let users handle the output by themselves, and we can provide them examples on how to do that.
In this case, we have to include the innerHTML along with the innerText in the output.
After some testing, I've noticed that when you start scraping without authentication, some posts won't provide the author profile url, in this case the selector group_post_author won't work.
Also, it's quite different how elements are being loaded in the desktop layout, in fact they won't until they show up on the viewport, and for that we should start scrolling before scraping.
@kaanyagci So yeah, I did it! The scraper works perfectly now with the new desktop layout of Facebook, and it has the same functionality as the one from the master branch. I think it's time you merge this to the development branch (after reviewing and testing it of course). Other features and new fields for the GroupPost interface should be added in a separate pull request to make it easier to organize things up!
@All-Contributors please add @iMrDJAi for code
@kaanyagci
I've put up a pull request to add @iMrDJAi! :tada:
@iMrDJAi This is excellent news! I was really busy with other stuff today. I'll test this first thing tomorrow! Great job! 💯
@kaanyagci Any updates? Have you tested it? Any issues?
Sorry for the delay. I was still a little busy :( I'll look ASAP.
Just checked. Sadly I can not get it to work.
import { FB } from './index';
async function main() {
const f = await FB.init({
debug: true,
output: 'test.json',
headless: false,
groupIds: ['774278349295443'],
useCookies: true,
disableAssets: true,
});
f.login('
main().then(() => { console.log('Done'); });
Gives the following output:
```sh
/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115
? new Error(`${response.errorText} at ${url}`)
^
Error: net::ERR_ABORTED at https://facebook.com
at navigate (/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async FrameManager.navigateFrame (/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
at async Frame.goto (/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
at async Page.goto (/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:819:16)
at async Facebook.login (/Users/kaanyagci/Documents/makepad/fbjs/dist/lib/models/fb.js:113:9)
Note: The output is the same for both headless and not headless modes.
I'll try to investigate these issues as soon as possible this week
@kaanyagci Interesting. in fact I haven't tried logging in, I been always testing in userless mode, I'll try that later and check what's going on.
For now you can try this:
;(async () => {
const { FB } = require("@makepad/fbjs")
const fb = await FB.init({
headless: true,
useCookies: false,
output: ''
})
//await fb.getGroupPosts("319144912641926", "./output.json")
await fb.getGroupPosts("319144912641926")
})()
The second error I've faced without login the web page used for group details is still the mobile page m.facebook.com
@kaanyagci That doesn't make sense, I'm 100% sure that I totally removed the mobile website. Fork my master branch again.
My bad, I was trying on another branch 🤦
This looks great actually. For the first issue, I've added the userAgent as Facebook rejects connections from headless browsers. I'll add this line once it's merged on development
branch! Anyway great work @iMrDJAi !
The mobile layout of Facebook provides limited data and low quality media, because of that the project should switch entirely to the desktop layout.
Todo list:
These are all the selectors explained: