edgi-govdata-archiving / web-monitoring-processing

Tools for access, "diff"-ing, and analyzing archived web pages
https://edgi-govdata-archiving.github.io/web-monitoring-processing
GNU General Public License v3.0
20 stars 20 forks source link

Import script should upload bodies directly to S3 #663

Open Mr0grog opened 4 years ago

Mr0grog commented 4 years ago

The Internet Archive import script(s) (wm import ia and wm import ia-known-pages) should have an option that causes them to upload Mementos to S3:

$ wm import ia 'http://www.epa.gov/' --from <time> --s3 <bucket_name_or_s3://_uri>

S3 credentials should be read from the standard AWS environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).

Setting this should cause the importer to save memento bodies in S3 before sending import metadata to web-monitoring-db. The metadata's uri property should be rewritten to point to the uploaded location in S3. The objects in S3 should be named by their SHA-256 base-16-encoded hash and their Content-Type header should be set appropriately, as web-monitoring-db currently does: https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/46561ae6eb52b0d923f7832100d161fc98667d0c/lib/archiver/archiver.rb#L36-L66

The value of the --s3 should be one of:

Why?

When we use the scripts here to import data from the Wayback Machine, we process the mementos from Wayback and then send the metadata to the DB’s import API. The API doesn’t actually accept the raw data of the response bodies (it’s complicated to do in a safe and effective way, and, although there is a placeholder for it in the code, we never made it happen).

The flow is something like this:

                    ┌─────────────────────┐
                    │                     │
                ┌──▶│   Wayback Machine   │◀─────────────────┐
                │   │                     │                  │
                │   └─────────────────────┘                  │
                │           ▲                                │
                └─────┐     │                                │
 ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─│─ ┐  │       ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐  │
   Import Script      │     │         Web-Monitoring-DB      │
 │                    │  │  │       │                     │  │
     ┌──────────────┐ │     │          ┌───────────────┐     │
 │   │              │ │  │  │       │  │    Process    │  │  │
     │  Search CDX  │─┘     │   ╔═════▶│   Metadata    │     │
 │   │              │    │  │   ║   │  │               │  │  │
     └──────────────┘       │   ║      └───────────────┘     │
 │           ║           │  │   ║   │          ║          │  │
             ▼              │   ║              ▼             │
 │   ┌───────────────┐   │  │   ║   │  ┌───────────────┐  │  │
     │               │      │   ║      │ Load Memento  │     │
 │   │ Load Mementos │───┼──┘   ║   │  │     Data      │──┼──┘          ┌───────────────┐
     │               │          ║      │               │       Metadata │               │
 │   └───────────────┘   │      ║   │  └───────────────┘  │    ╔═══════▶│  PostgreSQL   │
             ║                  ║              ║               ║        │               │
 │           ▼           │      ║   │          ▼          │    ║        └───────────────┘
     ┌───────────────┐          ║      ┌───────────────┐       ║
 │   │   Format as   │   │      ║   │  │               │  │    ║        ┌───────────────┐
     │  Importable   │══════════╝      │     Save      │═══════╣        │               │
 │   │   Metadata    │   │          │  │               │  │    ╚═══════▶│      S3       │
     └───────────────┘                 └───────────────┘       Memento  │               │
 └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘          └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘      Data   └───────────────┘

You can see here that both the import script and and web-monitoring-db have to get data from Wayback. The problem that happens here is when the Wayback Machine loads slowly (either because it’s under heavy load or we’re getting a memento that is rarely accessed), it might fail to load on the web-monitoring-db side, and therefore that record fails to save. This typically happens in about 1-2 of every 2000 mementos. Besides the occasional failure, this double-loading is also a waste of bandwidth and resources for both us and the archive!

(Note: The failures aren’t a serious problem because we typically grab overlapping sets of data from the Wayback Machine each time we run the script. The likelihood of a memento failing this way across multiple imports is pretty low. We do the overlap to work around the fact that Wayback has frequent indexing issues that sometimes cause mementos to be unfindable until several days after they were archived. The overlap period is longer than such outages typically last.)

We can work around this by doing what the old Versionista import script used to do: upload to S3 ourselves, before sending the metadata to web-monitoring-db. Web-monitoring-db will automatically skip loading mementos if the location it’s given is in an S3 bucket it already knows is OK.

Basically, we want the workflow to be more like:

                    ┌─────────────────────┐
                    │                     │
                ┌──▶│   Wayback Machine   │
                │   │                     │
                │   └─────────────────────┘
                │           ▲
                └─────┐     │
 ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─│─ ┐  │       ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
  Import Script       │     │        Web-Monitoring-DB
 │                    │  │  │       │                     │
     ┌──────────────┐ │     │          ┌───────────────┐
 │   │              │ │  │  │       │  │    Process    │  │
     │  Search CDX  │─┘     │   ╔═════▶│   Metadata    │
 │   │              │    │  │   ║   │  │               │  │
     └──────────────┘       │   ║      └───────────────┘
 │           ║           │  │   ║   │          ║          │
             ▼              │   ║              ║
 │   ┌───────────────┐   │  │   ║   │          ║          │
     │               │      │   ║              ║
 │   │ Load Mementos │───┼──┘   ║   │          ║          │
     │               │══════╗   ║              ║
 │   └───────────────┘   │  ║   ║   │          ║          │
             ║              ║   ║              ║
 │           ▼           │  ║   ║   │          ▼          │
     ┌───────────────┐      ║   ║      ┌───────────────┐
 │   │   Format as   │   │  ║   ║   │  │               │  │
     │  Importable   │══════╬═══╝      │     Save      │════╗
 │   │   Metadata    │   │  ║       │  │               │  │ ║
     └───────────────┘      ║          └───────────────┘    ║
 └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘  ║       └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ║
                            ║                               ║
                            ▼                               ║
                    ┌───────────────┐  ┌───────────────┐    ║
                    │               │  │               │    ║
                    │      S3       │  │  PostgreSQL   │◀═══╝
                    │               │  │               │
                    └───────────────┘  └───────────────┘
stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.