ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Possible to run in the cloud? #159

Closed BradCoffield closed 4 years ago

BradCoffield commented 4 years ago

Does anyone have any insights on the feasibility of setting up an instance of grab-site on a service like Heroku? I'd like to do so in order to take advantage of automating scrapes that I need to happen monthly and weekly. Also, would like to use cloud functions to listen and take the output and save the files to AWS.

I'm researching but it seems like maybe installing grab-site on Heroku isn't possible and I was hoping to get input before potentially wasting a bunch more time. Thanks!

BradCoffield commented 4 years ago

I've looked into it a bit more and I'm pretty sure that if it's possible it will require Docker.

raspher commented 3 years ago

Digging...

Currently working on Dockerfile for grab-site. Successfully compiled, needs a little enchancements and documentation (and ofc testing).

I know there exist some PR, but they are old and seems that are orphaned by authors.

Dockerfile should be reviewed at least one time per few years (it uses latest commit to build, but fixed python version as well as fixed alpine version), it's WIP for now :slightly_smiling_face:

brandongalbraith commented 3 years ago

I've used this docker container before successfully: https://hub.docker.com/r/slang800/grab-site/

raspher commented 3 years ago

@brandongalbraith yes, it's based on it a little

brandongalbraith commented 3 years ago

@raspher Looking forward to your updated Dockerfile 😄

raspher commented 3 years ago

Python 3.8 cannot be used, cannot pass URL due to error "TypeError: required field "posonlyargs" missing from arguments" (python 3.8 only)

raspher commented 3 years ago

Beta version of dockerfile

FROM python:3.7-alpine3.12
WORKDIR /app
RUN apk add --no-cache --update build-base libffi-dev libxml2-dev libxslt-dev re2-dev pkgconfig git libressl-dev musl-dev && \
    git clone --depth=1 --branch=master https://github.com/ArchiveTeam/grab-site.git && \
    cd grab-site && \
    pip3 install --upgrade pip setuptools && \
    pip3 install --no-binary lxml --upgrade ./ && \
    apk del --purge build-base libffi-dev pkgconfig git musl-dev && \
    rm -R /root/.cache
VOLUME ["/data"]
WORKDIR /data
EXPOSE 29000/tcp
CMD ["python", "/app/grab-site/gs-server"]
BradCoffield commented 3 years ago

This is beautiful. Can't wait to try it all out!