commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
321 stars 35 forks source link

Odd duplicate content behaviour on www.diariodeavila.es domain #44

Closed staylor-ds closed 3 years ago

staylor-ds commented 3 years ago

There appear to regularly be thousands of duplicate articles from this domain, always with identical initial paths but ending with different slugs

For example, I have noticed 2162 entries starting with https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863 that all appear to have identical content (the article title is El Senado de EEUU aprueba la legalidad del 'impeachment' for all of these urls). Here are some example urls:

 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/111m-para-crear-una-red-de-areas-de-descanso-para-camiones
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/120-contagiados-y-un-fallecido-balance-covid-de-la-jornada
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/124-castellanos-y-leoneses-han-recibido-ya-la-segunda-dosis
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/13-enmiendas-de-xav-por-14-millones-a-las-cuentas-regionales
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/14-positivos-en-el-cribado-de-la-zona-de-madrigal
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/160-contagios-mas-en-avila-en-un-dia-sin-fallecidos-covid
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/16-positivos-en-el-primer-dia-de-cribado-en-cebreros
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/170-nuevos-casos-covid-y-un-fallecido-en-el-hospital
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/172-positivos-covid-mas-y-casi-medio-centenar-de-ingresados
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/180-casos-y-un-fallecido-por-covid-balance-del-dia
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/184-detenidos-en-la-tercera-noche-de-disturbios-en-paises-bajos
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/200-incidencias-en-la-red-de-abastecimiento-por-el-temporal
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-deja-la-menor-cifra-de-empleados-publicos-del-decenio
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-dejo-la-menor-cifra-de-muertos-en-carreteras-abulenses
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-dejo-una-caida-minima-en-la-afiliacion-de-extranjeros
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-un-buen-ano-para-el-cerro-gallinero
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2145-vacunas-contra-la-covid-salen-de-avila
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2186-nuevos-casos-la-cifra-mas-alta-desde-noviembre

These were found in https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/02/CC-NEWS-20210210002910-00147.warc.gz

The problem appears to have started on the 1st Feb 2021 with the volume of pages from this site rising from ~50 per day to ~12000 per day.

sebastian-nagel commented 3 years ago

Thanks for the notice, @staylor-ds! The news sitemap which lists these duplicates is now excluded.

sebastian-nagel commented 3 years ago

Solved by blocking the news sitemap - blocking is done by setting the nextFetchDate in the status index to a date far in the future.

staylor-ds commented 3 years ago

There appears to be the same problem with https://www.diariodeburgos.es/noticia/Z26A4505D-FD6A-2B2C-77AEFE1D801F5BA1/202101/10-burgaleses-al-dia-en-urgencias-por-resbalar-por-el-hielo and 1506 other urls pointing to an article with the title "Messi llega caliente a la Champions".

I found this url in https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/02/CC-NEWS-20210213235332-00221.warc.gz

sebastian-nagel commented 3 years ago

Thanks again, @staylor-ds! Solved the same way. Verified that there are no other news sites which follow the same sitemap name pattern and are also affected.