MassiveSum: A large summarization dataset for 92 languages - 13 Indian languages with ~1.9million article summary pairs
The sources have been curated manually, and articles downloaded form archive.org
The summaries are mined from metadata information in the HTML like meta tags like 'og:description': hence supposedly diverse. However, I see that the headline and metadata title is the same for some Indian language websites I checked.
So this dataset could be more like a headline generation set
Paper: https://aclanthology.org/2021.emnlp-main.797/ Repo: https://github.com/danielvarab/massive-summ