WASAPI-Community / data-transfer-apis

WASAPI data transfer APIs
42 stars 6 forks source link

Suggestions for /webdata endpoint, support for 'open' warcs #3

Open ikreymer opened 7 years ago

ikreymer commented 7 years ago

I wanted to offer some thoughts on the /webdata endpoint in general and some possible areas of improvement for supporting other services, such as Webrecorder.

One issue that I see is the time for how long the files list returned from /webdata should be considered valid, and what happens if files change between a request to /webdata and request to retrieve the WARC file.

It might be useful to include at a least a timestamp to the response to indicate at what time the /webdata was retrieved.

One possible workaround is to also include a validUntil timestamp that should guarantee that the WARCs listed are available through that time, and that after this time, a user should not trust the /webdata listing. For example, if a user did not retrieve all the WARCs by that time, they should query /webdata again to get a more updated listing.

Although, this may not be possible to guarantee in a general crawler based system, as for example, a new WARC could be added to the collection a few seconds after /webdata was called, making the file listing out of date anyway.

Another idea is to have an 'open' WARC type, something like:

    "files": [
        {
            "content-type": "application/warc",
            "filename": "2016-08-30-blah.warc.gz",
            "type": "open",
            "lastModified": "...",
            "size": 2000,
            "locations": [
                "http://webrecorder.io/api/wasapi/v0/...blah.warc.gz",
            ]
        }

By adding "type": "open", the system indicates that this WARC is still being written to, and may change between the time of /webdata call and the time it is retrieved. Since the WARC may be changing, the checksum is not included here, but the size is, and the size should be at least the specified size when downloaded. A lastModified field is included to indicate when this WARC file was last updated (this could be useful to add to all WARC files).

This will address the Webrecorder use case where users may be actively recording when /webdata call is made, and therefore the exact size and checksum may change between this query and the actual download. This would be useful to any system that allows live updating of the archive.

A more difficult issue is how to deal with systems, such as Webrecorder, which are not simply additive but allow users to delete or modify collections. For example, in Webrecorder, a user could delete a recording (specified by one or more WARCs) within a collection. In such a case, I suppose the WARC download should return 404 immediately.

Alternatively, if the validUntil timestamp is used, the api could "freeze" the particular until the expiration time, allowing the api users to download the WARCs exactly as they were at that time (this may be a bit more complex).

Ideally, the simplest approach would be taken, which is probably to allow some form of open warcs and handle deletion as a 404.

ikreymer commented 7 years ago

To put it more succinctly, I think there are two main options that could be implemented:

OR

It would be useful to have a lastModified field for all WARCs regardless.

nlevitt commented 7 years ago

I expressed my thoughts to Ilya on iipc slack:

my inclination is to keep it simple i would advocate /webdata return truth at the time of the query no guarantee that the files won't be deleted before you try to download them and i'm not sure i see the need to support .open files in this api if you have a use case for that, maybe describe it on the issue?

I'm not opposed to lastModified.

ikreymer commented 7 years ago

The use case for having open files is that Webrecorder generally keeps files open until they have been idle for some period of time (an internal) setting, and a user may add to any collection or recording at any time. Since the replay is available immediately, the download should also reflect what the user can access.

The issue is mostly with specifying the checksum in the /webdata, because it can change between the time it was listed and when the user starts downloading the WARC.

I guess the simplest solution is just making the checksum optional, and maybe indicating that the WARC is 'open' (in the process of being written to). I think this would solve this issue without adding extra complexity (like taking snapshots).