Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE
Christian Møller Dahl, Torben Johansen, Christian Vedel,
University of Southern Denmark
Welcome to the GitHub repository for OccCANINE, a tool designed to transform occupational descriptions into standardized HISCO (Historical International Standard Classification of Occupations) codes automatically. Developed by Christian Møller Dahl, Torben Johansen and Christian Vedel from the University of Southern Denmark, this tool leverages the power of a finetuned language model to process and classify occupational descriptions with high accuracy, precision, and recall.
Paper: https://arxiv.org/abs/2402.13604
Huggingface: Christianvel/OccCANINE
Slides: Breaking the HISCO Barrier
How to use OccCANINE: YouTube video
How to cite (click to expand)
> Dahl, C. M., Johansen, T., Vedel, C. (2024). Breaking the HISCO Barrier: Automatic Occupational Standardization with *OccCANINE*. [arxiv.org/abs/2402.13604](https://arxiv.org/abs/2402.13604)
```bibtex
@misc{OccC2024breaking,
title={Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE},
author={Christian Møller Dahl and Torben Johansen and Christian Vedel},
year={2024},
eprint={2402.13604},
archivePrefix={arXiv},
primaryClass={cs.CL}
url={https://arxiv.org/abs/2402.13604}
}
```
Getting started
- See the colab notebook for a demonstration of OccCANINE
- A step-by-step installation guide can be found in GETTING_STARTED.md
- Run
python predict.py --fn-in path/to/input/data.csv --col occ1 --fn-out path/to/output/data.csv --language [lang]
in the command line to get HISCO codes for all the descriptions found in the occ1
column in the inputted data. See predict.py for details.
- To see a simple script which reads data and uses OccCANINE to obtain HISCO codes see PREDICT_HISCOs.py.
Overview
This repository provides everything needed to generate automatic HISCO codes from occupational descriptions using OccCANINE. It also provides replication files for all steps from raw training data to the final trained model.
Structure
- Data_cleaning_scripts: Contains R scripts for processing raw data from 'Data/Raw_data' into a format suitable for training, which is then stored in 'Data/Training_data', 'Data/Validation_data', and 'Data/Test_data'.
- histocc: Contains Python scripts for training OccCANINE and using the already finetuned version of it.
- Model_evaluation_scripts: Contains a mix of R and Python scripts which generates model evaluation statistics and plots of these, which are found in the associated paper.
histocc folder
The histocc folder contains all the code used for training and application of OccCANINE.
- Data/: Contains 'key.csv' and which maps integer codes (generated by OccCANINE) to HISCO codes based off definitions by https://github.com/cedarfoundation/hisco. It also contains toydata to use when trying out OccCANINE for the first time.
- model_assets.py: Defines the unlerying pytorch model
- attacker.py: Defines text attack procedure used for text augmentation in training.
- trainer.py: Defines training procedures.
- dataLoader.py: Defines how data is loaded and fed to the model in training.
- prediction_assets.py: Functions and classes to use OccCANINE. This also contains the 'OccCANINE' class, which serves as the main user interface in most cases.
Model_evaluation_scripts folder
The Model_evaluation_scripts folder contains all the code used to generate the model evaluation results shown in the paper.
Python scripts
- n001_Predict_eval.py: Runs predictions on 1 million validation observations.
- n002_Copenhagen_burial.py: Runs predictions on 200 observations from the Copenhagen Burial Records from Link Lives
- n003_Training_ship_data.py: Runs predictions on 200 observations of parent's occupations from the Indefatigable training ship
- n004_Dutch_familiegeld.py: Runs predictions on 200 observations of occupations in the Dutch familiegeld
- n005_Swedish_strikes.py: Runs predictions on 200 observations of the profession of Swedish strikes
R scripts
- 000_Functions.R: Contains functions used in evaluation.
- 001_Generate_eval_stats.R: Generates accuracy, precision, etc. for validation data across various subgroups.
- 002_Nature_of_mistakes.R: Returns plots and statistics which generate insights into the nature of mistakes, when OccCANINE disagrees with the validaiton data.
- 101_Eval_illustrations.R: Generates most of the illustrations and statistics shown in the paper.
- 102_Embeddings_visualisation.R: This makes the embedding t-sne illustrations.
Data Cleaning
Scripts for data cleaning are located in 'Data_cleaning_scripts' and should be run in numeric order as indicated by the script names.
- 000_Function.R: Contains functions shared across all data cleaning scripts.
- 001_Assets_for_cleaning.R: Generates assets for data cleaning, such as the encoding of HISCO to a 1 to N encoding.
- 00[x]_...R (where x>1): Cleans individual data sources, saving one file 'Clean_....Rdata' for each source.
- 101_Train_test_val_split.R: Ensures consistency and saves training, validation, and test data.
Data Cleaning Process
- Sanity Check: Manual verification of data content and consistency.
- Extracting Relevant Data: Extraction of relevant variables, keeping both 'raw' and 'cleaned' occupational descriptions as separate observations when available.
- Combinations: Synthetic creation of descriptions representing more than one occupation by combining descriptions with the respective language's word for 'and'.
- Filtering: Removal of observations with invalid HISCO codes based on the 'hisco' R library.
Structure of Training Data
The training data is structured with variables including the year of observation, a unique ID for every observation, the occupational description string, HISCO codes, integer codes for HISCO, the language of the description, and a string indicating the data split (train, val1, val2, or test).