Facebook archiving - Githubissues

djhmateer commented 8 months ago

I've got a Facebook archiver working by using the wacz_enricher.py

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159

Am using a stored profile to be able to get images which require you to be logged in.

Am running this archiver from a residential IP as if run from a cloud, then FB will block the requests.

This archiver is run as well as the main archiver (which runs on a cloud)

looks for any url which contains facebook.com and has an archive status of: wayback: (have added a new config flag called fb_archiver so that the gsheet_feeder.py only gets the rows we want)
runs the wacz archiver only
runs hash_enricher and screenshot_enricher

It may be that this can be much simpler if I can run everything sequentially (and not on 2 servers)., Need to wait for more bandwidth on residential network, then can potentially do a PR.

Also I've found I need to keep testing the profile as it will need to be re-logged in after a few weeks.

msramalho commented 7 months ago

Looking forward to that PR, we can indeed have an option to run a specific archiver via a residential IP proxy.

msramalho commented 4 months ago

Taking another look at this, can you clarify if you're doing any extra downloads/requests or simply parsing data form inside the wacz?

djhmateer commented 4 months ago

Hi Miguel

From:

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159

Probably best to follow along on link above.

Apart from the /photo special case, I get the root page, then parse it for resources, getting the fb_id and set_id. Then jump down to

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L400

which does another request (and another wacz download), then returns the next fb_id back to the main function above.

Regards Dave

bellingcat / auto-archiver

Facebook archiving #105