A solution to scrape Markdown from posts

iMrDJAi commented 3 years ago

Note: This solution applies to the desktop version of the Facebook website, just as the other solutions I'm providing to improve this library, you should switch from the mobile version first then I'll start making some pull requests.

Scrapping text from posts on the desktop version is much complicated than the mobile version, since it comes in the form of HTML elements rather than plain text, the key here is finding the right selector for the post body, as for the other elements we need to scrape like images, videos, submission permalink... and other staff, this needs a separate issue and a deeper discussion.

Anyway, using the browser inspector we can see how it looks like under the hood:

You'll notice that It's located between two pseudo-elements (::before and ::after), we just need to copy the .innerHTML of the parent element, then converting it to markdown, and there is a very good library for that called turndown, and as you can see from the image below, we MADE IT!

Another issue is the See More button, you should click it first to allow more text to appear:

And that's all, I hope that this information will help <3

iMrDJAi commented 3 years ago

cc @kaanyagci

kaanyagci commented 3 years ago

Hi @iMrDJAi thank you very much for your feedback, perfectly detailed.

It seems doable to me but, as you know for instance the first priority for the library is the TypeScript migration + NodeJS module support, that's why I've added to P3.

Makepad-fr / fbjs

A solution to scrape Markdown from posts #35