Genreate datasets - Githubissues

eklem / stopword-sami

Sami stopword lists for natural language processing. Examples on use could be search engines, machine learning and chatbots.

MIT License

1 stars 0 forks source link

Genreate datasets #7

Closed eklem closed 2 years ago

eklem commented 2 years ago

https://se.wikipedia.org/wiki/Erenoam%C3%A1%C5%A1:Buot_siiddut?from=ADA_universitehta&to=&namespace=0 seems to contain too many stubs, so it's maybe not so good.

This means we need some other text sources for our datasets.

https://www.nrk.no/sapmi could maybe be a good one, I just need to understand what three Sami languages that are represented.

eklem commented 2 years ago

https://www.nrk.no/sapmi/samegillii/

First column - "Saernie - Åarjelsaemien" is "Sørsamisk" URL to crawl: https://www.nrk.no/sapmi/saernie---aarjelsaemien-1.13572943
Second column - "Ådåsa - Julevsábmáj" is "Lulesamisk" URL to crawl: https://www.nrk.no/sapmi/adasa---julevsabmaj-1.13572946
Third column - "Ođđasat - Davvisámegillii" is "Nordsamisk" URL to crawl: https://www.nrk.no/sapmi/o__asat---davvisamegillii-1.13572949

eklem commented 2 years ago

So, need to create a crawler to get the content from these three pages. Try click a 2000 times on vis flere and then get the content of the page. There is 5 article stubs for each click.

eklem commented 2 years ago

Check if Playwright is the right tool.

eklem commented 2 years ago

From version 0.0.3 of nrk-sapmi-crawler I can fetch JSON files with article IDs. Set it up for South Sami, Lulesami and North Sami. Do a re-crawl every now and then. Let the data gathering begin 😄

eklem commented 2 years ago

https://github.com/eklem/stopword-sami/commit/cdd59940957d4bee225b49e7acf73aabd2b4defa