bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
489 stars 53 forks source link

Facebook archiving #105

Open djhmateer opened 8 months ago

djhmateer commented 8 months ago

I've got a Facebook archiver working by using the wacz_enricher.py

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159

Am using a stored profile to be able to get images which require you to be logged in.

Am running this archiver from a residential IP as if run from a cloud, then FB will block the requests.

This archiver is run as well as the main archiver (which runs on a cloud)

It may be that this can be much simpler if I can run everything sequentially (and not on 2 servers)., Need to wait for more bandwidth on residential network, then can potentially do a PR.

Also I've found I need to keep testing the profile as it will need to be re-logged in after a few weeks.

msramalho commented 7 months ago

Looking forward to that PR, we can indeed have an option to run a specific archiver via a residential IP proxy.

msramalho commented 4 months ago

Taking another look at this, can you clarify if you're doing any extra downloads/requests or simply parsing data form inside the wacz?

djhmateer commented 4 months ago

Hi Miguel

From:

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159

Probably best to follow along on link above.

Apart from the /photo special case, I get the root page, then parse it for resources, getting the fb_id and set_id. Then jump down to

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L400

which does another request (and another wacz download), then returns the next fb_id back to the main function above.

Regards Dave