JMSLab / LaroplanOCR

Swedish primary school curricula (Läroplaner för grundskolan) in digital format.
MIT License
2 stars 0 forks source link

LaroplanOCR

Swedish primary school curricula (Läroplaner för grundskolan) in digital format.

We use optical character recognition (OCR) to transform curricula in image format into text. For each curriculum we construct datasets at the paragraph, sentence, and word levels.

Using the Datasets

In ./analysis/output/ you will find the following for the Läroplan for year YYYY:

In the folder ./example/ we provide an illustration on how to use the data to search for, and plot, counts of a desired set of words.

Using the Code

Prerequisites

You may want to compile some or all of the code yourself. To do so, you need the following prerequisites.

Users may notice small differences in output across machines due to the way images are processed.

Repository structure

Each folder hosts an /output/ subfolder where output from each script is saved.

Quick start

  1. Clone the repository to your local machine.

    # Using SSH
    git lfs clone git@github.com:JMSLab/LaroplanOCR.git
    # Using HTTPS
    git lfs clone https://github.com/JMSLab/LaroplanOCR.git
  2. Install dependencies. From the root of the repo run:

    pip install -r requirements.txt
  3. To compile the entire project, open the command-line and run

    python run.py

    You may also compile specific steps of the pipeline. For example, python derived/code/make_images.py will transform the pdf files of the curricula into jpg files.

Citations

Acknowledgments

We thank our dedicated research assistants for contributions to this project.