Open ikreymer opened 7 years ago
To put it more succinctly, I think there are two main options that could be implemented:
/webdata
and the download call.OR
validUntil
timestamp, where any open WARC could be cached at the time of the query and kept around through the validUntil
timestamp. This is a bit more heavy duty to implement, and could result in a user not getting the latest version of a WARC, but could still support a checksum
for every WARC.It would be useful to have a lastModified
field for all WARCs regardless.
I expressed my thoughts to Ilya on iipc slack:
my inclination is to keep it simple i would advocate /webdata return truth at the time of the query no guarantee that the files won't be deleted before you try to download them and i'm not sure i see the need to support .open files in this api if you have a use case for that, maybe describe it on the issue?
I'm not opposed to lastModified.
The use case for having open files is that Webrecorder generally keeps files open until they have been idle for some period of time (an internal) setting, and a user may add to any collection or recording at any time. Since the replay is available immediately, the download should also reflect what the user can access.
The issue is mostly with specifying the checksum
in the /webdata, because it can change between the time it was listed and when the user starts downloading the WARC.
I guess the simplest solution is just making the checksum
optional, and maybe indicating that the WARC is 'open' (in the process of being written to). I think this would solve this issue without adding extra complexity (like taking snapshots).
I wanted to offer some thoughts on the /webdata endpoint in general and some possible areas of improvement for supporting other services, such as Webrecorder.
One issue that I see is the time for how long the files list returned from
/webdata
should be considered valid, and what happens if files change between a request to/webdata
and request to retrieve the WARC file.It might be useful to include at a least a
timestamp
to the response to indicate at what time the/webdata
was retrieved.One possible workaround is to also include a
validUntil
timestamp that should guarantee that the WARCs listed are available through that time, and that after this time, a user should not trust the/webdata
listing. For example, if a user did not retrieve all the WARCs by that time, they should query/webdata
again to get a more updated listing.Although, this may not be possible to guarantee in a general crawler based system, as for example, a new WARC could be added to the collection a few seconds after
/webdata
was called, making the file listing out of date anyway.Another idea is to have an 'open' WARC type, something like:
By adding
"type": "open"
, the system indicates that this WARC is still being written to, and may change between the time of/webdata
call and the time it is retrieved. Since the WARC may be changing, the checksum is not included here, but thesize
is, and the size should be at least the specified size when downloaded. AlastModified
field is included to indicate when this WARC file was last updated (this could be useful to add to all WARC files).This will address the Webrecorder use case where users may be actively recording when
/webdata
call is made, and therefore the exact size and checksum may change between this query and the actual download. This would be useful to any system that allows live updating of the archive.A more difficult issue is how to deal with systems, such as Webrecorder, which are not simply additive but allow users to delete or modify collections. For example, in Webrecorder, a user could delete a recording (specified by one or more WARCs) within a collection. In such a case, I suppose the WARC download should return 404 immediately.
Alternatively, if the
validUntil
timestamp is used, the api could "freeze" the particular until the expiration time, allowing the api users to download the WARCs exactly as they were at that time (this may be a bit more complex).Ideally, the simplest approach would be taken, which is probably to allow some form of open warcs and handle deletion as a 404.