christianvedels / OccCANINE

A method for automatically converting occupational descriptions into HISCO codes
Apache License 2.0
14 stars 2 forks source link

Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE

Christian Møller Dahl, Torben Johansen, Christian Vedel, University of Southern Denmark


Welcome to the GitHub repository for OccCANINE, a tool designed to transform occupational descriptions into standardized HISCO (Historical International Standard Classification of Occupations) codes automatically. Developed by Christian Møller Dahl, Torben Johansen and Christian Vedel from the University of Southern Denmark, this tool leverages the power of a finetuned language model to process and classify occupational descriptions with high accuracy, precision, and recall.

Paper: https://arxiv.org/abs/2402.13604

Huggingface: Christianvel/OccCANINE

Slides: Breaking the HISCO Barrier

How to use OccCANINE: YouTube video

How to cite (click to expand) > Dahl, C. M., Johansen, T., Vedel, C. (2024). Breaking the HISCO Barrier: Automatic Occupational Standardization with *OccCANINE*. [arxiv.org/abs/2402.13604](https://arxiv.org/abs/2402.13604) ```bibtex @misc{OccC2024breaking, title={Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE}, author={Christian Møller Dahl and Torben Johansen and Christian Vedel}, year={2024}, eprint={2402.13604}, archivePrefix={arXiv}, primaryClass={cs.CL} url={https://arxiv.org/abs/2402.13604} } ```

Getting started

Overview

This repository provides everything needed to generate automatic HISCO codes from occupational descriptions using OccCANINE. It also provides replication files for all steps from raw training data to the final trained model.

Structure

histocc folder

The histocc folder contains all the code used for training and application of OccCANINE.

Model_evaluation_scripts folder

The Model_evaluation_scripts folder contains all the code used to generate the model evaluation results shown in the paper.

Python scripts

R scripts

Data Cleaning

Scripts for data cleaning are located in 'Data_cleaning_scripts' and should be run in numeric order as indicated by the script names.

Data Cleaning Process

  1. Sanity Check: Manual verification of data content and consistency.
  2. Extracting Relevant Data: Extraction of relevant variables, keeping both 'raw' and 'cleaned' occupational descriptions as separate observations when available.
  3. Combinations: Synthetic creation of descriptions representing more than one occupation by combining descriptions with the respective language's word for 'and'.
  4. Filtering: Removal of observations with invalid HISCO codes based on the 'hisco' R library.

Structure of Training Data

The training data is structured with variables including the year of observation, a unique ID for every observation, the occupational description string, HISCO codes, integer codes for HISCO, the language of the description, and a string indicating the data split (train, val1, val2, or test).