Closed staylor-ds closed 3 years ago
Thanks for the notice, @staylor-ds! The news sitemap which lists these duplicates is now excluded.
Solved by blocking the news sitemap - blocking is done by setting the nextFetchDate in the status index to a date far in the future.
There appears to be the same problem with https://www.diariodeburgos.es/noticia/Z26A4505D-FD6A-2B2C-77AEFE1D801F5BA1/202101/10-burgaleses-al-dia-en-urgencias-por-resbalar-por-el-hielo
and 1506 other urls pointing to an article with the title "Messi llega caliente a la Champions".
I found this url in https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/02/CC-NEWS-20210213235332-00221.warc.gz
Thanks again, @staylor-ds! Solved the same way. Verified that there are no other news sites which follow the same sitemap name pattern and are also affected.
There appear to regularly be thousands of duplicate articles from this domain, always with identical initial paths but ending with different slugs
For example, I have noticed 2162 entries starting with
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863
that all appear to have identical content (the article title isEl Senado de EEUU aprueba la legalidad del 'impeachment'
for all of these urls). Here are some example urls:These were found in https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/02/CC-NEWS-20210210002910-00147.warc.gz
The problem appears to have started on the 1st Feb 2021 with the volume of pages from this site rising from ~50 per day to ~12000 per day.