N0taN3rd / wail

:whale2: One-Click User Instigated Preservation
http://matkelly.com/wail
GNU General Public License v3.0
121 stars 9 forks source link

Feature Request : Add "upload warc file to archive.org " feature #88

Open kanihal opened 6 years ago

kanihal commented 6 years ago

Add "upload warc file to archive.org " feature ( which may take secret token from archive.org user account which is needed for bulk upload) to WAIL in electron so that wayback machine (web.archive.org) can index the snapshot of the site.

machawk1 commented 6 years ago

@kanihal As far as I know, even if a WARC file is uploaded to archive.org, it won't be ingest by the globally accessible Wayback Machine at archive.org unless it is a "privileged" account like Archive-Team's. The feature you mentioned could still be accomplished, i.e., the WARC generated by WAIL could be uploaded to archive.org but the contents held within the WARC will not be replayable through the expected means.

kanihal commented 6 years ago

From Archive Team FAQ - http://archiveteam.org/index.php?title=Frequently_Asked_Questions

To ensure content integrity, items with WARC files must have the mediatype set to “web” 
and be under the Archive Team collection in order for it to be ingested by the Wayback Machine.

@machawk1 1.How do you get previleged account from archive.org ? is it even possible now? 2.Do we need to send request to Archive Team to get my warc to their ArchiveTeam collection?

  1. I find that there is an option to make wayback machine save your content if you send url that you want to save to http://web.archive.org/save/
    wget http://web.archive.org/save/ <url>

    Does bulk requests for 'save' on all urls that we get from crawling work? Don't they some limit per IP or something?

machawk1 commented 6 years ago

@kanihal I think this privileged access is just that, i.e., limited to those with the credentials or from Archive-Team tools like Warrior. If anyone were to upload a WARC for ingestion by Wayback, the content may have been manipulated in the WARC prior or may lack some other form of integrity, so I have my doubts as to whether they take external WARC contributions without some form of vetting.

Sending a URI is different from sending a WARC, particularly with the capabilities of the preservation tool contained within the Electron version of WAIL. Further, submitting a URI on archive.org does not give a user access to the generated WARC file. I believe archive.org's "Save Page Now" is meant more for one-offs and not bulk preservation.

archive.org's features are outside of the scope of WAIL.