arquivo / pwa-technologies

Arquivo.pt main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The PWA search engine is a public service at http://archive.pt and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date. The PWA search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs. Currently, it serves an archive collection searchable by full-text with 180 million documents ranging between 1996 and 2010.
http://www.arquivo.pt
GNU General Public License v3.0
41 stars 7 forks source link

"mp_" after date breaks link to web-archived version #1386

Closed dcgomes closed 2 months ago

dcgomes commented 4 months ago

If we open: https://www.fct.pt/noticias/index.phtml.pt?id=595&/2020/10/Concurso_AI_4_COVID-19_selecionou_12_projetos

We get: image

However, this link "Veja uma versão arquivada desta página de 2023-3-10 em arquivo.pt" links to a wayback URL containg "mp" after the date timestamp: https://arquivo.pt/wayback/20230310111136mp/https://former.fct.pt/noticias/index.phtml.pt?id=788&ano=2022&mes=3/Concurso_de_Projetos_de_I&D_recebeu_4101_candidaturas And the replayed page is: image

Instead of the one replayed by the correct link without "mp_": https://arquivo.pt/wayback/20230310111136/https://former.fct.pt/noticias/index.phtml.pt?id=788&ano=2022&mes=3/Concurso_de_Projetos_de_I&D_recebeu_4101_candidaturas image

Is this a rewrite problem or maybe some creative use of Arquivo404 at the fct.pt website (they use 2 domains for the website www.fct.pt and former.fct.pt)?

VascoRatoFCCN commented 2 months ago

Found that it has to do with pywb's template engine, jinja2. For some reason, when an URL has a & symbol, after loading the "mp_" version of the page it converts it into & breaking the link. A more clear example of this:

https://arquivo.pt/wayback/20210201082911mp_///sapo.pt/?utm_source=bsu&utm_medium=web&utm_campaign=bsu_logo&utm_content=www.sapo.pt

After clicking the above link, the page isn't found because the URL has changed & into &: image

Manually replacing & with & in the URL bar causes the page to be loaded correctly: image

This ONLY happens when following mp_ links, removing the mp_ from the original link will make & symbols to be rendered correctly: https://arquivo.pt/wayback/20210201082911///sapo.pt/?utm_source=bsu&utm_medium=web&utm_campaign=bsu_logo&utm_content=www.sapo.pt

Related to https://github.com/arquivo/pwa-technologies/issues/1285 and https://github.com/arquivo/pywb-arquivo/commit/40bbcb8c933697e679bb3994dbcb59e390228c70

VascoRatoFCCN commented 2 months ago

Found the culprit, it's the top_url variable that's rendered by jinja2 in the head_insert_mobile.html template, it needed the safe modifier to render properly. Left a comment on https://github.com/webrecorder/pywb/issues/696 explaining this edge case.