Project should be easier to set up and run

bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

https://pypi.org/project/auto-archiver/

MIT License

578 stars 60 forks source link

Project should be easier to set up and run #1

Closed loganwilliams closed 1 year ago

loganwilliams commented 3 years ago

Currently, running this project in an automated way requires creating a Digital Ocean Spaces bucket and manually managing cron jobs on a Linux server. Ideally, this would be simpler to deploy so that a new archiving spreadsheet could be set up in a user friendly way, even for non-programmers.

One promising possibility for moving in this direction is as a Google Sheets Add On, but other ideas can also be explored and evaluated.

borismw commented 2 years ago

The tough part here is the automated provisioning of infrastructure such as an S3 compatible storage. Not that this isn't automateable. But you would probably write a tool and then be dealing with full admin credentials to provision cloud infrastructure at a provider on the user's behalf. How is the program meant to be used? As a centrally hosted service or should every user be able to host the auto-archiver on their own?

A first step could be to containerize the application with docker, put the cron-job inside and see where it goes from there? I could help with that. I could also help with describing the infrastructure to be provisioned as code. However that is still a bit away from a user friendly way, even for non-programmers...

loganwilliams commented 2 years ago

@borismw You're exactly right about the challenges here. Definitely wasn't imagining a centrally hosted service, that's difficult for both cost and security. I was thinking that setting it up to use Google Drive storage (authorized by the user as an OAuth app) might be a good alternative storage destination that would require less administration.

djhmateer commented 2 years ago

Hi @loganwilliams and @borismw - agreed that should be easier to setup and run.

I've mostly automated my server build and cron job setup: https://github.com/djhmateer/auto-archiver/tree/main/infra ... will PR this when is ready if you like.

I am also writing a Google Drive (albeit service account) implementation right now. Should have a POC running next week, and can submit a PR.

borismw commented 2 years ago

This looks really great. I'm playing around with an install in docker which looks similar. Perhaps containerizing this service could enable running it as a scheduled containerized workload. Then running this scraping service would be pay per use and since neither cpu nor memory intensive tasks are performed costs could shrink significantly. This of course wouldn't make the auto_archiver easier to install but maybe then it's feasible to offer this service to a larger group of people. Well, a lot of ifs. I'll see where this leads me to.

djhmateer commented 2 years ago

Good thoughts @borismw I've got a PR in just now which reduces the memory overhead of Firefox by reloading the driver on each row, and I'm noticing only about 2GB used of 4 I've provisioned. Before that I had 8GB usage over a long run.

As much as I love containers, I've been bitten by edge cases, so have moved more towards script automation of building servers. The end results are similar ie one click deploys.

I've got a https://www.proxmox.com/en/ server running with auto-archiver instances on it which works well. Happy to chat about a service to you or anyone, as I'm thinking along similar lines! davemateer@gmail.com is my personal email.

loganwilliams commented 1 year ago

Stale issue