dpriskorn / odsc

Project that aims to sentenize all the open data of Riksdagen and other sources to create an easily linkable dataset of sentences that can be refered to from Wikidata lexemes and other resources
GNU General Public License v3.0
0 stars 0 forks source link
civic-tech entity-linking folketinget named-entity-recognition nlp part-of-speech-tagging riksdagen riksdagensoppnadata wikidata wikidata-lexemes

Open Data Sentence Corpora

This civic science project aims to analyze and sentenize all the open data of Riksdagen and other sources using spaCy to create an easily linkable dataset of sentences that can be refered to from Wikidata lexemes and other resources.

The advantage of such a dataset is huge from a language perspective. The sentences contain valuable information about what is going on in society. They contain a lot of words, phrases and idioms which are highly valuable to anyone interested in the language. The 600k documents to be analyzed contains a lot of political dialogue and written documents from institutions in the Swedish state.

Keywords: NLP, data science, open data, swedish, open government data, riksdagen, sweden, API

Author

Dennis Priskorn.

Idea

Use spaCy to create the first version. All sentences are language detected and given an UUID which is unique for each release.

As better sentenizing becomes available or Riksdagen improve their data over time, the hashes and UUIDs will change, but all released versions will be locked in time and can always be refered to consistently and reliably.

The resulting dataset is planned to be released in Zenodo and is expected to be around 1TB

Features

Scope

This way of chopping up open data can be applied to any open data, provided that it is in a machine readable form like TEXT, XML, JSON or HTML.

Riksdagen has about 600k documents that can be downloaded as open data.

This project is a stepping stone to an even larger database of sentences and tokens that we can use to enrich the lexicographic data in Wikidata.

Statistics

See STATISTICS.md

Design

API design inspired by

Data model

Datamodel

UML source

Installation

Clone the repo

Run

$ pip install poetry && poetry install

Also download the model needed

$ python -m spacy download sv_core_news_lg (250 MB)

Now download some of the source datasets from Riksdagen and put them in a data/sv/ folder hierarchy.

Use

$ python riksdagen_analyzer --analyze

Sources

Mostly unilingual

Related corpora

Inspiration

Alice Zhao https://www.youtube.com/watch?v=8Fw1nh8lR54

Thanks

Thanks to Nicolas Vigneron and Asof Bartov for dicussions about the needs of Luthor and how to make this project most suitable as a source of sentences used in usage examples on Wikidata lexemes.

License

GPLv3+

What I learned