Closed PedroG1515 closed 1 year ago
Experimented adding a 4 item limit on every 6h time window. Available for testing on development:
https://dev.arquivo.pt/url/search?q=https://www.portugal.gov.pt/pt/gc22/
Removed the 4 item limit per 6h, we decided to handle this as a crawling issue rather than front-end issue. We feared that the filtering might end up excluding content that the users would be interested in.
We'll treat this as an isolated case and revisit this if it happens again.
Recheck after deploy new CDXJ cleaned from warc/revisits
Deploying cleaned up CDXJ files did not fix the issue. Postponing this issue to the next milestone.
The CDXJ entries for this URL are not warc/revisits and reference different contents.
To tackle this issue we adopted the following procedure:
After this was implemented, it now shows a much more reasonable number of results:
This also fixed #1308
What is the URL that originated the issue? https://arquivo.pt/url/search?q=https://www.portugal.gov.pt/pt/gc22/
What happened? For 2022 we already have 14105 versions for the same URL. We already check, and they are all different sizes (maybe due to the website dynamics)
What should have happened? We must have some kind of limitation or filter.
Screenshots![image](https://user-images.githubusercontent.com/25795364/151992965-c195221f-dd00-4f74-a944-0f8551ab1f1a.png)