"Session deleted because of page crash from tab crashed"

nickpreston24 commented 1 year ago

I'm not sure why this is happening, but this is what I get when I run sudo bash run.sh from your last updates to main, @igisp .

igisp commented 1 year ago

Hi @MikePreston17 , the run.sh script in its current form is not meant to work out of the box. It's more like a list of example commands. So in order to make it work, you would want to first run command docker build -t fb_posts . in the same dir as Dockerfile, and then run an example command sudo docker run -v ~/test/volumes:/var/volumes --entrypoint python fb_posts scraper.py /var/volumes "https://www.facebook.com/iamthatprophet/videos/2530938387225313".

However, even though I was able to fix the Selenium and browser issue, the script itself (scraper.py) doesn't give a meaningful output right now, due to the changes on the FB side.

I have started looking into using the FB Graph API, but I don't have a working POC yet. The scraping approach is unreliable as the goal posts will move in the future, as it just happened.

nickpreston24 commented 1 year ago

"doesn't give a meaningful output right now, due to the changes on the FB side."

Ok. Even if that's a problem, we have the comments and the general 'shape' of the that we can still go off of. It's a bit desperate, but I can still parse using advanced Regex and still get a Comment, with some trouble.

For example, I captured a straight up Save As of the conversation with Josh who attacked Victor. That conversation had a bunch of encoded ids and classes, which are pretty useless to me.

However, I'm going to omit that junk and filter out HTML with the pattern </div></div></span></div></div></div></div><div class=..., which seems to include a comment anytime I see that specific pattern.

That might just be sufficient. That, and tracking the 'lastScraped' in our Airtable database.

From there, I can diff the classless HTML and Braden can style it.

As for the API, good luck. I don't get the sense that Comments are supported or a big priority for them, but I could be - and hopefully - am wrong about that. :)

nickpreston24 commented 1 year ago

Hi @MikePreston17 , the run.sh script in its current form is not meant to work out of the box. It's more like a list of example commands. So in order to make it work, you would want to first run command docker build -t fb_posts . in the same dir as Dockerfile, and then run an example command sudo docker run -v ~/test/volumes:/var/volumes --entrypoint python fb_posts scraper.py /var/volumes "https://www.facebook.com/iamthatprophet/videos/2530938387225313".

However, even though I was able to fix the Selenium and browser issue, the script itself (scraper.py) doesn't give a meaningful output right now, due to the changes on the FB side.

I have started looking into using the FB Graph API, but I don't have a working POC yet. The scraping approach is unreliable as the goal posts will move in the future, as it just happened.

If you would, please send up the changes you made. I'm still getting build issues when running the docker build ... command you specified:

igisp commented 1 year ago

There are several things happening here:

What changes are you talking about here?... "If you would, please send up the changes you made."
So I had to change the chromium package version in Dockerfile to make it build: 111.0.5563.110-1~deb11u1
Looks like all the older posts I had saved links to have no comments anymore; as if they have been purged or expired.
I have finally found a post that has at least some comments on it, and tried to run it through the script. It crashed with a weird error, so I would say my script is pretty much useless with the new FB design.
I am looking into the FB API. Will let you know it it's any useful.

nickpreston24 commented 1 year ago

Talking about this thing you said: "However, even though I was able to fix the Selenium and browser issue"

Thought you may have had a commit there, sorry.

igisp commented 1 year ago

I see. I did fix Dockerfile in the last commit. But it just got broken again. I can commit the newest fix, but the scraping script will probably crash anyways.

igisp commented 1 year ago

I just added a commit with the latest Dockerfile fix.

nickpreston24 commented 1 year ago

4. I have finally found a post that has at least some comments on it, and tried to run it through the script. It crashed with a weird error, so I would say my script is pretty much useless with the new FB design.

It need not be useful at all for extracting classes and id's or for tracing them, either, since they seem to be encoding those now, e.g. xabc123 xkiy456 .... The html structure won't change (the dev's won't do that frequently anyways).

The only thing I need it to do at this point is to grab the post as soon as it's requested. Use Selenium, Puppeteer, Playwright, whatever. Anything.

As long as your Python can make a curl/Javascript POST request to my Airtable DB, we're fine.

I can parse it before or after the fact, and very handily. (hit the Go button to see).

igisp commented 1 year ago

It doesn't even get to the point where it grabs anything. Selenium crashes at the attempt to open a post page. Besides, the main reason for using Selenium was to expand all the comments, which cannot be done without selecting the buttons/links that expand them. I am assuming all the comments have to be expanded when sent down the pipeline to your tool?

nickpreston24 commented 1 year ago

I am assuming all the comments have to be expanded when sent down the pipeline to your tool?

Yes, absolutely.

You might try Puppeteer/Playwright. I have one somewhere I used for another project. It's in Javascript, though.

HarvestHaven / facebook-scraper

"Session deleted because of page crash from tab crashed" #1