!archiveonly many URLs - Githubissues

hannahwhy commented 11 years ago

[22:02:11] <ivan`> feature request: submit a few hundred !archiveonly URLs via a URL to a text file listing said URLs

Maybe something like this?

!ao < http://www.example.com/urls.txt

ivan commented 11 years ago

Cool, I like that syntax.

hannahwhy commented 11 years ago

How I think this should work:

The metajob

!ao < URL (or !a < URL) creates a metajob:

<me> !ao < http://www.example.com/urls.txt
<ArchiveBot> me: Archiving URLs in http://www.example.com/urls.txt without recursion.
<ArchiveBot> me: Use !status m33w44jv1vtrzxuk2ttyq3uwmd for updates, !abort m33w44jv1vtrzxuk2ttyq3uwmd to abort.

A metajob is queued up like any other job:

<me> !status m33w44jv1vtrzxuk2ttyq3uwmd
<ArchiveBot> me: In progress.  Fetching URL manifest.

<me> !status m33w44jv1vtrzxuk2ttyq3uwmd
<ArchiveBot> me: In progress.  0/1500 URLs downloaded.

Each URL in a metajob turns into an ArchiveBot job. (This implies one WARC per URL.) A job may be queued, skipped (i.e. if it's been archived within the past 48 hours), completed, in progress, or (when this is implemented for jobs in general) timed out.

Metajob queuing

Metajobs are handled by a separate worker pool. Jobs in a metajob are queued until all workers are occupied. At that point, we only queue job j+1 after j finishes.

This provides fair queuing behavior. An example:

<me> !ao < http://www.example.com/urls.txt
<someone_else> !a http://www.example.org/

Let's say that http://www.example.com/urls.txt contained 1500 URLs. If we immediately queued them, someone_else's job would be at the end of the 1500-URL queue, which could very easily take days to complete. The above queuing behavior makes it possible for the jobs of the metajob and someone_else's job to run in parallel.

However, if there are multiple unoccupied workers, this lets the metajob expand to fill spare capacity.

Metajob failure modes

The worker handling the metajob could crash (OOM, segfault, all the other crap real computers are subject to). The worker should be able to resume the metajob not by starting it over, but by resuming where it left off. This means that a particularly troublesome URL could be purged from the metajob.

The worker handling the metajob could be intentionally aborted. In this case, we abort the current job and do not execute any more of the metajob's remaining jobs.

Metajob monitoring

A metajob should have its own status page in the dashboard, e.g. http://archivebot.example.org/#/metajobs/m33w44jv1vtrzxuk2ttyq3uwmd. This dashboard will list the URLs in the metajob, their status, and links to their WARCs (if present). This makes it easier to retrieve all WARCs in a metajob.

(Such a feature could also be used as the basis for automatic megawarc generation.)

@ivan: Your thoughts?

ivan commented 11 years ago

I think the original feature can be implemented by just passing all of the URLs into one wget instance. One real advantage is that the HTTP connection is reused rather than re-established for every request. Also, with one wget, the queuing behavior is already what one expects, and there are fewer WARCs to deal with (particularly in the case of passing ~100,000 URLs).

If there are other advantages of metajobs, or you plan to move away from wget, I guess metajobs are worth implementing. It's up to you; I'm not familiar with your code or plans.

hannahwhy commented 11 years ago

I think the original feature can be implemented by just passing all of the URLs into one wget instance. One real advantage is that the HTTP connection is reused rather than re-established for every request. Also, with one wget, the queuing behavior is already what one expects, and there are fewer WARCs to deal with (particularly in the case of passing ~100,000 URLs).

I'd like that too. There's one sticking point about it, though, which is that it clashes with ArchiveBot's "an ident is one URL" model.

The clash is not too great, but it has one odd effect, which is as follows: If you create a job from a URL list, the URLs in the list should be subject to the same cooldown time as any other URL. But they won't be. Such jobs will also require slightly different history record structures.

This isn't a huge deal, but it made me wonder whether or not the existing model could be extended without having to modify those subsystems.

All that said...

Connection reuse (and being as good a netizen as we can be) is a very good advantage, and is one that I think will sway me towards the "one wget instance" approach, even if it requires some redesign.

chfoo commented 10 years ago

Maybe it would be better to restrict what !archiveonly accepts so the infrastructure doesn't have to change a lot. The burden of metajobs could just rely on the user making proper lists.

ArchiveBot could only accept a URL list that has all the same domains. The list will just be fed into Wget/Wpull.
To discourage blocking the queue, the list can limited to 100000 URLs.
The "main" URL could be the URL that has most parent path. This URL will be the one chosen for the cooldown time.
If the user would like to load in more batches of URLs, an expire option could be used.
The list could require at least 10 URLs discourage misuse.

hannahwhy commented 10 years ago

The metajob thing was driven by a desire to ram !ao < FILE into the existing architecture, which I now realize isn't necessary.

There's been some work towards getting this to work in the ao-many branch, but it's gone a bit stale. I'll work on bringing it up to date.

hannahwhy commented 10 years ago

This landed in master and has been successfully used once, so I'm closing this issue.

I'm sure there'll be errors. Open more issues :smiling_imp:

ArchiveTeam / ArchiveBot

!archiveonly many URLs #14

The metajob

Metajob queuing

Metajob failure modes

Metajob monitoring