Rhizome-Conifer / conifer

Collect and revisit web pages.
https://conifer.rhizome.org
Apache License 2.0
1.48k stars 120 forks source link

(Semi)-Automatic capture with web recorder #304

Open fbuchinger opened 7 years ago

fbuchinger commented 7 years ago

We got a list of approx 1200 web.archive.org URLs that we like to capture to a warc using web recorder. Are there also semi-automatic ways to do this (i.e. some Kind of API that allows you to submit a list of URLs for capture, web recorder would process them one after another)

ikreymer commented 7 years ago

This is not yet possible, but it is a use case we are considering. Actually, it seems like there are two parts to your request:

1) Extracting already archived content from an existing archive (web.archive.org) 2) Automated recording of a list of urls.

Am I understanding this correctly?

fbuchinger commented 7 years ago

yes, thats correct. web.archive.org has huge latencies, so I prefer to save the 1200 snapshot URLS into a local WARC. It would be nice if web recorder was a bit smarter about archiving this content (e.g. the web.archive.org url prefix could be stripped, since we are re-archiving content, automatically generate the snapshot urls,...).

But I suppose there are other use cases for batch-recording a list of urls like web-monitoring a bunch of blogs, doing a time-based change analysis of websites etc...

Anyway, I'm curious with what you are coming up...

fbuchinger commented 7 years ago

Since I need this batch-capture feature quite urgently, I might implement it as a Tampermonkey script that gets injected in https://webrecorder.io/(project)/(collection)/$new and calculates the needed urls upfront. Then my userscript would submit the first URL for capture, wait for the load event of the #replay_iframe and continue with the next url.

Are there any Javascript APIs that can help me doing this? Or do I need to fake keyboard inputs in the url field?

ikreymer commented 7 years ago

We currently have an experimental system for patching existing archives, and I am looking at an extraction mode feature as well.

For automation, that is something that is still being planned, but there are a few things you could use in the meantime that come pretty close if this is urgent.

You can start recording simply by loading: https://webrecorder.io/<user>/<collection>/<recording session>/record/<url> and then waiting for the page to load.

If you'd like to use the remote browser mode, then you can use https://webrecorder.io/<user>/<coll>/<recording session>/record/$br:chrome:53/<url> which will load the url in a remote browser and will record the entire page.

However, the extraction mode is not yet in place so if you're using it with existing archives, it won't get the 'raw' content.. We are working on supporting that feature as well and can let you know when its ready for testing (soon).

fbuchinger commented 7 years ago

Thanks! I could easily create a WARC archive of my snapshots using your URL API.

2017-02-21 0:15 GMT+01:00 Ilya Kreymer notifications@github.com:

We currently have an experimental system for patching existing archives, and I am looking at an extraction mode feature as well.

For automation, that is something that is still being planned, but there are a few things you could use in the meantime that come pretty close if this is urgent.

You can start recording simply by loading: https://webrecorder.io///<recording session>/record/ and then waiting for the page to load.

If you'd like to use the remote browser mode, then you can use https://webrecorder.io///<recording session>/record/$br:chrome:53/ which will load the url in a remote browser and will record the entire page.

However, the extraction mode is not yet in place so if you're using it with existing archives, it won't get the 'raw' content.. We are working on supporting that feature as well and can let you know when its ready for testing (soon).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/webrecorder/webrecorder/issues/304#issuecomment-281203625, or mute the thread https://github.com/notifications/unsubscribe-auth/AARiyqgGD3PG8ghWmotZIZJbFZTH_hT0ks5reh6NgaJpZM4MCZ8a .

ikreymer commented 7 years ago

We also have some new "extraction" features that are now beta testing. If you're interested in trying it out, send us your username at support@webrecorder.io and we can enable it for your account.

jotjot commented 6 years ago

I have tried the solution shown here and it seems that and must already exist. Is this correct? Besides this question it worked great for me.

ikreymer commented 6 years ago

Yes, the extraction mode has been available since last year. if you enter a url from another supported, Webrecorder will automatically detect this and will enter extraction mode.

Here's our announcement of this feature: http://rhizome.org/editorial/2017/jul/12/webrecorder-announcement/

The second part of the question (automated capture of a list) is also in development!

jotjot commented 6 years ago

Sorry, it seems I was not clear. I do: "You can start recording simply by loading: https://webrecorder.io/<user>/\<collection>/\<recording session>/record/\<url> and then waiting for the page to load." But, it seems to me that the \<collection> and the \<recording session> must exist already, they are no created automatically by accessing this address. It would be great if I could provide a "random" \<recording session> and that would be created upon opening the \<url>.

peterk commented 6 years ago

+1. A simple API would be great for small scale recording and automation. Some use cases:

  1. Pushing a single web page to be recorded by Webrecorder from a different social media harvester (e.g. SFM).
  2. Recording an instagram account page and also be able to automatically click through each image to capture comments.
  3. Be able to create a new collection.
JanTappe commented 5 years ago

Hi all, Amazing tool! Great work!. Is it in accordance with instagram's terms of use, particularly with "... You can't attempt to create accounts or access or collect information in unauthorized ways. This includes creating accounts or collecting information in an automated way without our express permission." (https://help.instagram.com/581066165581870) ?