arquivo / pwa-technologies

Arquivo.pt main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The PWA search engine is a public service at http://archive.pt and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date. The PWA search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs. Currently, it serves an archive collection searchable by full-text with 180 million documents ranging between 1996 and 2010.
http://www.arquivo.pt
GNU General Public License v3.0
38 stars 7 forks source link

Change arquivo404 arquivo.pt archiveApiUrl #1305

Closed franciscoesteveira closed 1 year ago

franciscoesteveira commented 1 year ago

Make arquivo404 request arquivo.pt/arquivo404 and redirect these requests to the memento api URL.

This allows us to easily gather statistics about the usage of this service.

Also change the Internet Archive commented example to a working URL.

dcgomes commented 1 year ago
  1. Create redirect https://arquivo.pt/arquivo404server to https://arquivo.pt/wayback
  2. Change https://arquivo.pt/arquivo404.js (https://github.com/arquivo/arquivo404/blob/master/arquivo404.js) { timeout: 2000, archiveName: "Arquivo.pt", archiveApiUrl: "https://arquivo.pt/wayback/timemap/link/" } to { timeout: 2000, archiveName: "Arquivo.pt", archiveApiUrl: "https://arquivo.pt/arquivo404server/timemap/link/" }
  3. Deploy to production
dcgomes commented 1 year ago

This change was deployed to production in 6 October 2022. @PedroG1515 check logs and awstats to see if we can get the required usage stats for arquivo404 (string "arquivo404server").

dcgomes commented 1 year ago

Most likely we will have to get usage stats from logs because awstats processes redirects as errors: http://logs.arquivo.pt/awstats/awstats.pl?month=10&year=2022&output=urldetail&config=arquivo.pt&framename=index And Google Analytics requires running an embedded JavaScript.

PedroG1515 commented 1 year ago

Tested.