eklem / stopword-sami

Sami stopword lists for natural language processing. Examples on use could be search engines, machine learning and chatbots.
MIT License
1 stars 0 forks source link

I need text sources #17

Open eklem opened 2 years ago

eklem commented 2 years ago

If I get ahold of more text or a site I can crawl for any of the other languages than North-, Lule- and South Sami, I'll create a stopword list for those too.

And doesn't matter if the language is not spoken in Norway.

gunnarvelle commented 2 years ago

Sitemap-listings for southern sami on ndla.no

https://ndla.no/sitemap-urn-subject-1-11c4696f-e844-4c98-8df7-49d43f59ec33.txt https://ndla.no/sitemap-urn-subject-1-a532138d-e16a-4046-a46e-bd5bc9487b8b.txt https://ndla.no/sitemap-urn-subject-1-a5d7da3a-8a19-4a83-9b3f-3c855621df70.txt https://ndla.no/sitemap-urn-subject-1-20e0fdca-5237-4095-a9e5-cea7d63866c0.txt https://ndla.no/sitemap-urn-subject-1-b8a448f0-e251-41ea-af1c-b2fd62a89828.txt https://ndla.no/sitemap-urn-subject-1-d4511941-a1fc-4336-bc80-0a05c534a182.txt https://ndla.no/sitemap-urn-subject-1-962dd49d-72e8-4576-9efb-69d93a95402e.txt https://ndla.no/sitemap-urn-subject-1-f7c5f36a-198d-4c38-a330-2957cf1a8325.txt

eklem commented 2 years ago

Thank you @gunnarvelle ! I'll include the content that has longer text and where I can be pretty sure it's only Southern Sami language.

eklem commented 2 years ago

Check Sami newspapers!

eklem commented 2 years ago

Depend on corpus-sma and corpus-smj

eklem commented 2 years ago

External library: corpus-smj-sma-json. Will be new dependency.