Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE

Christian Møller Dahl, Torben Johansen, Christian Vedel, University of Southern Denmark

Welcome to the GitHub repository for OccCANINE, a tool designed to transform occupational descriptions into standardized HISCO (Historical International Standard Classification of Occupations) codes automatically. Developed by Christian Møller Dahl, Torben Johansen and Christian Vedel from the University of Southern Denmark, this tool leverages the power of a finetuned language model to process and classify occupational descriptions with high accuracy, precision, and recall.

Paper: https://arxiv.org/abs/2402.13604

Huggingface: Christianvel/OccCANINE

Slides: Breaking the HISCO Barrier

How to use OccCANINE: YouTube video

How to cite (click to expand)

> Dahl, C. M., Johansen, T., Vedel, C. (2024). Breaking the HISCO Barrier: Automatic Occupational Standardization with *OccCANINE*. [arxiv.org/abs/2402.13604](https://arxiv.org/abs/2402.13604) ```bibtex @misc{OccC2024breaking, title={Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE}, author={Christian Møller Dahl and Torben Johansen and Christian Vedel}, year={2024}, eprint={2402.13604}, archivePrefix={arXiv}, primaryClass={cs.CL} url={https://arxiv.org/abs/2402.13604} } ```

Getting started

See the colab notebook for a demonstration of OccCANINE
A step-by-step installation guide can be found in GETTING_STARTED.md
Run python predict.py --fn-in path/to/input/data.csv --col occ1 --fn-out path/to/output/data.csv --language [lang] in the command line to get HISCO codes for all the descriptions found in the occ1 column in the inputted data. See predict.py for details.
To see a simple script which reads data and uses OccCANINE to obtain HISCO codes see PREDICT_HISCOs.py.

Overview

This repository provides everything needed to generate automatic HISCO codes from occupational descriptions using OccCANINE. It also provides replication files for all steps from raw training data to the final trained model.

Structure

Data_cleaning_scripts: Contains R scripts for processing raw data from 'Data/Raw_data' into a format suitable for training, which is then stored in 'Data/Training_data', 'Data/Validation_data', and 'Data/Test_data'.
histocc: Contains Python scripts for training OccCANINE and using the already finetuned version of it.
Model_evaluation_scripts: Contains a mix of R and Python scripts which generates model evaluation statistics and plots of these, which are found in the associated paper.

histocc folder

The histocc folder contains all the code used for training and application of OccCANINE.

Data/: Contains 'key.csv' and which maps integer codes (generated by OccCANINE) to HISCO codes based off definitions by https://github.com/cedarfoundation/hisco. It also contains toydata to use when trying out OccCANINE for the first time.
model_assets.py: Defines the unlerying pytorch model
attacker.py: Defines text attack procedure used for text augmentation in training.
trainer.py: Defines training procedures.
dataLoader.py: Defines how data is loaded and fed to the model in training.
prediction_assets.py: Functions and classes to use OccCANINE. This also contains the 'OccCANINE' class, which serves as the main user interface in most cases.

Model_evaluation_scripts folder

The Model_evaluation_scripts folder contains all the code used to generate the model evaluation results shown in the paper.

Python scripts

n001_Predict_eval.py: Runs predictions on 1 million validation observations.
n002_Copenhagen_burial.py: Runs predictions on 200 observations from the Copenhagen Burial Records from Link Lives
n003_Training_ship_data.py: Runs predictions on 200 observations of parent's occupations from the Indefatigable training ship
n004_Dutch_familiegeld.py: Runs predictions on 200 observations of occupations in the Dutch familiegeld
n005_Swedish_strikes.py: Runs predictions on 200 observations of the profession of Swedish strikes

R scripts

000_Functions.R: Contains functions used in evaluation.
001_Generate_eval_stats.R: Generates accuracy, precision, etc. for validation data across various subgroups.
002_Nature_of_mistakes.R: Returns plots and statistics which generate insights into the nature of mistakes, when OccCANINE disagrees with the validaiton data.
101_Eval_illustrations.R: Generates most of the illustrations and statistics shown in the paper.
102_Embeddings_visualisation.R: This makes the embedding t-sne illustrations.

Data Cleaning

Scripts for data cleaning are located in 'Data_cleaning_scripts' and should be run in numeric order as indicated by the script names.

000_Function.R: Contains functions shared across all data cleaning scripts.
001_Assets_for_cleaning.R: Generates assets for data cleaning, such as the encoding of HISCO to a 1 to N encoding.
00[x]_...R (where x>1): Cleans individual data sources, saving one file 'Clean_....Rdata' for each source.
101_Train_test_val_split.R: Ensures consistency and saves training, validation, and test data.

Data Cleaning Process

Sanity Check: Manual verification of data content and consistency.
Extracting Relevant Data: Extraction of relevant variables, keeping both 'raw' and 'cleaned' occupational descriptions as separate observations when available.
Combinations: Synthetic creation of descriptions representing more than one occupation by combining descriptions with the respective language's word for 'and'.
Filtering: Removal of observations with invalid HISCO codes based on the 'hisco' R library.

Structure of Training Data

The training data is structured with variables including the year of observation, a unique ID for every observation, the occupational description string, HISCO codes, integer codes for HISCO, the language of the description, and a string indicating the data split (train, val1, val2, or test).

christianvedels / OccCANINE

readme