Open nickpreston24 opened 1 year ago
Hi @MikePreston17 , the run.sh
script in its current form is not meant to work out of the box. It's more like a list of example commands. So in order to make it work, you would want to first run command docker build -t fb_posts .
in the same dir as Dockerfile
, and then run an example command sudo docker run -v ~/test/volumes:/var/volumes --entrypoint python fb_posts scraper.py /var/volumes "https://www.facebook.com/iamthatprophet/videos/2530938387225313"
.
However, even though I was able to fix the Selenium and browser issue, the script itself (scraper.py
) doesn't give a meaningful output right now, due to the changes on the FB side.
I have started looking into using the FB Graph API, but I don't have a working POC yet. The scraping approach is unreliable as the goal posts will move in the future, as it just happened.
"doesn't give a meaningful output right now, due to the changes on the FB side."
Ok. Even if that's a problem, we have the comments and the general 'shape' of the
that we can still go off of. It's a bit desperate, but I can still parse using advanced Regex and still get a Comment, with some trouble.For example, I captured a straight up Save As of the conversation with Josh who attacked Victor. That conversation had a bunch of encoded ids
and classes
, which are pretty useless to me.
However, I'm going to omit that junk and filter out HTML with the pattern </div></div></span></div></div></div></div><div class=...
, which seems to include a comment anytime I see that specific pattern.
That might just be sufficient. That, and tracking the 'lastScraped' in our Airtable database.
From there, I can diff the classless HTML and Braden can style it.
As for the API, good luck. I don't get the sense that Comments are supported or a big priority for them, but I could be - and hopefully - am wrong about that. :)
Hi @MikePreston17 , the
run.sh
script in its current form is not meant to work out of the box. It's more like a list of example commands. So in order to make it work, you would want to first run commanddocker build -t fb_posts .
in the same dir asDockerfile
, and then run an example commandsudo docker run -v ~/test/volumes:/var/volumes --entrypoint python fb_posts scraper.py /var/volumes "https://www.facebook.com/iamthatprophet/videos/2530938387225313"
.However, even though I was able to fix the Selenium and browser issue, the script itself (
scraper.py
) doesn't give a meaningful output right now, due to the changes on the FB side.I have started looking into using the FB Graph API, but I don't have a working POC yet. The scraping approach is unreliable as the goal posts will move in the future, as it just happened.
If you would, please send up the changes you made. I'm still getting build issues when running the docker build ...
command you specified:
There are several things happening here:
chromium
package version in Dockerfile
to make it build: 111.0.5563.110-1~deb11u1
Thought you may have had a commit there, sorry.
I see. I did fix Dockerfile in the last commit. But it just got broken again. I can commit the newest fix, but the scraping script will probably crash anyways.
I just added a commit with the latest Dockerfile
fix.
4. I have finally found a post that has at least some comments on it, and tried to run it through the script. It crashed with a weird error, so I would say my script is pretty much useless with the new FB design.
It need not be useful at all for extracting classes and id's or for tracing them, either, since they seem to be encoding those now, e.g. xabc123 xkiy456 ...
. The html
structure won't change (the dev's won't do that frequently anyways).
The only thing I need it to do at this point is to grab the post as soon as it's requested. Use Selenium, Puppeteer, Playwright, whatever. Anything.
As long as your Python can make a curl/Javascript POST request to my Airtable DB, we're fine.
I can parse it before or after the fact, and very handily. (hit the Go button to see).
It doesn't even get to the point where it grabs anything. Selenium crashes at the attempt to open a post page. Besides, the main reason for using Selenium was to expand all the comments, which cannot be done without selecting the buttons/links that expand them. I am assuming all the comments have to be expanded when sent down the pipeline to your tool?
I am assuming all the comments have to be expanded when sent down the pipeline to your tool?
Yes, absolutely.
You might try Puppeteer/Playwright. I have one somewhere I used for another project. It's in Javascript, though.
I'm not sure why this is happening, but this is what I get when I run
sudo bash run.sh
from your last updates tomain
, @igisp .