eklem / stopword-sami

Sami stopword lists for natural language processing. Examples on use could be search engines, machine learning and chatbots.
MIT License
1 stars 0 forks source link
lule-sami nlp northern-sami southern-sami stopwords

stopword-sami

NPM version NPM downloads MIT License

What

WIP! Project to generate stopword lists for all the Sami languages:

Grant from the Sami Parliament

The Sami Parliament is financially supporting the project. Hooray! This will make it possible to finish the project.

Sámediggi Sámedigge Saemiedigkie
The Sami Parliament The Sami Parliament The Sami Parliament

Other Sami languages

These are not planned as of now, but could be if we find text sources and someone to help us verify the lists.

When the quality of the stopword lists are good enough they will be added to the stopword module. Northern Sami will most likely be the first that reaches good enough quality. Then you'll have Lule Sami and South Sami.

Why stopword lists for Sami languages?

To i.e. be able to create good search engines or do machine learning based on content written in the different Sami langauges.

Install

If you can avoid crawling and just use the content from this repo, that's good. That means less unnecessary trafick on nrk.no. Content is here and will be updated every month, or more often if you need it and published to npm.

npm install stopword-sami

To crawl and calculate

To get more content, you first have to get more IDs, so first the crawlIds-command, then the crawlContent-command and then the calcStopwords-command.

npm run crawlIds && npm run crawlContent && npm run calcStopwords

Work ahead

Help needed

We need help to verify generated list and help me understand different traits of the different Sami languages when that time comes.

Also, to generate/train stopword lists, we need text sources. For Northern Sami we will get what we need, but for Lulesami and South Sami it's a little thin. Maybe we just have to wait for NRK to create more content. For the rest of the languages, we have no source so far. If you know of a data-set or a source to generate a data set, please give us a hint!

Applications: Markdown to Word/PDF conversion

So far, Pandoc has worked well:

pandoc application-draft-02.md -f markdown -s -o application-draft.docx