arquivo / pwa-technologies

Arquivo.pt main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The PWA search engine is a public service at http://archive.pt and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date. The PWA search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs. Currently, it serves an archive collection searchable by full-text with 180 million documents ranging between 1996 and 2010.
http://www.arquivo.pt
GNU General Public License v3.0
39 stars 7 forks source link

Memento native support #100

Closed arquivo closed 8 years ago

arquivo commented 9 years ago

Originally reported on Google Code with ID 101

Change the Wayback software or add a Memento Proxy over the Open Search API to natively
support the Memento protocol.

If we cannot fully support the Memento protocol try to only support the TimeMap component
of the protocol.

Check documentation:
- Archive.is (now archive.today) supports Memento.  Michael Nelson wrote up a nice
walk through of functionality - http://ws-dl.blogspot.com/2013/07/2013-07-09-archiveis-supports-memento.html

- Memento 101 slides:
http://ws-dl.blogspot.com/2014/08/2014-08-26-memento-101-overview-of.html

- The formal spec is RFC 7089 and the Pattern that you would typically implement for
TimeGates is http://www.mementoweb.org/guide/rfc/#Pattern2.1 (same as archive.today).
You would also implement TimeMaps.

- we have a TimeGate/TimeMap server that maybe you could just run next to your archive.
It could translate between the version API of the archive and the Memento protocol.
We have developed such translations for GitHub, arXiv.org, MediaWikis. See https://github.com/mementoweb/timegate

http://mementoweb.org/depot/proxy/webarchives/
(até dão o exemplo do AWP)

http://mementoweb.org/depot/native/ia/

http://www.mementoweb.org/guide/quick-intro/

http://www.mementoweb.org/

http://www.iswc2013.semanticweb.org/sites/default/files/iswc_poster_10.pdf

Reported by danielcoelhogomes on 2014-12-12 11:03:33

arquivo commented 9 years ago
Additional info provided by Herbert Van de Sompel to support Memento:

From my perspective, I see two ways that would
allow you to add Memento support:

=> Install Open Wayback, which natively comes with Memento support

=> Overlay Memento functionality on your current archive-access
software. With that regard, the task can be split in two components:
(1) Operate a TimeGate/TimeMap that interacts with your current API:
We are developing a stand-alone TimeGate/TimeMap server intended to be
able to interoperate with a lot of different versioning systems. The
current version is at https://github.com/mementoweb/timegate . In
order to create a Memento compliant interface to your system, the only
thing that needs to happen is develop a stub that uses your API to
collect XML and map it to Memento. This is something we can do on your
behalf, even. We have done it, so far, for e.g. GitHub, arxiv.org,
Wikipedia, ...

(2) Add Memento HTTP response headers when your archive serves
Mementos. That is something that we can obviously not do. Only you can
do it but it should be really straightforward to accomplish. It is
about adding a Memento-Datetime header that contains the archival
datetime of the Memento, and about adding a Link header that provides
a link with the "original" relation type pointing at the original
resource for the Memento and, recommended, also a link to the TimeGate
and TimeMap. Realy basic to achieve this.
With those in place, your current system could be compliant in no time.

Reported by danielcoelhogomes on 2014-12-15 15:23:21

arquivo commented 9 years ago

Reported by danielcoelhogomes on 2014-12-15 15:23:39

Fernando-Melo commented 8 years ago

According to the Memento team the new version with PyWb is passing on all the memento validator tests

http://mementoweb.org/tools/validator/info.html?uri=http%3A%2F%2Fp27.arquivo.pt%2Fpywb%2Freplay%2Fhttp%3A%2F%2Fwww.caleida.pt%2Fsaramago%2F&type=timegate&datetime=Sun%2C+01+Apr+2010+12%3A00%3A00+GMT&aggregator=http%3A%2F%2Ftimetravel.mementoweb.org%2Ftimegate%2F&follow=on&full_timemap=on&Submit=Submit