coala / coala-bears

Bears for coala
https://coala.io/
GNU Affero General Public License v3.0
295 stars 580 forks source link

Linked webpage archiver #1322

Open jayvdb opened 7 years ago

jayvdb commented 7 years ago

Currently we have a bear that check for invalid links: InvalidLinkBear.

It would be good to also submit links to an archiving service like Internet Archive, so there is a backup when the site goes offline.

https://github.com/fossasia/gci16.fossasia.org/issues/922

As this is using external webservices being used in CI, requests should be batched at least per file if possible, and even potentially across CI runs using the coala cache.

It would be best to use a Memento client where possible to do the checking whether a backup exists already.

m1guelpf commented 7 years ago

👍

jayvdb commented 7 years ago

There are many free webpage archiving services, but there does not appear to be a common API for submitting archiving requests.

See https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives .

Many use the WaybackMachine software, so it should be possible to switch between different Wayback-powered archiving service with this bear. (maybe that should be a separate bug?)

The other services will all have their own unique submission process, and I doubt we want to support them in this first version of the bear. Someone else can add other services if they want, after this bear exists ;-)

One difficult part for this task is how to test it. maybe we can use a service like httpbin.org to generate a unique webpage each test run, and archive it ;-)