Closed dcgomes closed 2 months ago
Found that it has to do with pywb's template engine, jinja2. For some reason, when an URL has a &
symbol, after loading the "mp_" version of the page it converts it into &
breaking the link.
A more clear example of this:
After clicking the above link, the page isn't found because the URL has changed &
into &
:
Manually replacing &
with &
in the URL bar causes the page to be loaded correctly:
This ONLY happens when following mp_
links, removing the mp_ from the original link will make &
symbols to be rendered correctly:
https://arquivo.pt/wayback/20210201082911///sapo.pt/?utm_source=bsu&utm_medium=web&utm_campaign=bsu_logo&utm_content=www.sapo.pt
Related to https://github.com/arquivo/pwa-technologies/issues/1285 and https://github.com/arquivo/pywb-arquivo/commit/40bbcb8c933697e679bb3994dbcb59e390228c70
Found the culprit, it's the top_url
variable that's rendered by jinja2 in the head_insert_mobile.html
template, it needed the safe
modifier to render properly. Left a comment on https://github.com/webrecorder/pywb/issues/696 explaining this edge case.
If we open: https://www.fct.pt/noticias/index.phtml.pt?id=595&/2020/10/Concurso_AI_4_COVID-19_selecionou_12_projetos
We get:
However, this link "Veja uma versão arquivada desta página de 2023-3-10 em arquivo.pt" links to a wayback URL containg "mp" after the date timestamp: https://arquivo.pt/wayback/20230310111136mp/https://former.fct.pt/noticias/index.phtml.pt?id=788&ano=2022&mes=3/Concurso_de_Projetos_de_I&D_recebeu_4101_candidaturas And the replayed page is:
Instead of the one replayed by the correct link without "mp_": https://arquivo.pt/wayback/20230310111136/https://former.fct.pt/noticias/index.phtml.pt?id=788&ano=2022&mes=3/Concurso_de_Projetos_de_I&D_recebeu_4101_candidaturas
Is this a rewrite problem or maybe some creative use of Arquivo404 at the fct.pt website (they use 2 domains for the website www.fct.pt and former.fct.pt)?