christianvedels / OccCANINE

A method for automatically converting occupational descriptions into HISCO codes
Apache License 2.0
14 stars 2 forks source link

Allocating wives and widows with their husbands occupation (language: se) #111

Open suvihe opened 2 months ago

suvihe commented 2 months ago

OccCANINE matches titles with the word "widow" (änka) and "wife" (fru) to their husbands occupation, if it is given. Depending on the research question, this could become an issue (i.e., inflates employment in certain occupations, inflates female labor force participation).

The problem is specifically with Swedish language model, as in the Swedish context, it was relatively common for women to list themselves with titles such as "captain's wife". For example, "advokatsänka" (translation: lawyer's widow) is matched to 12110 both in IPUMS and by OccCANINE.

There are a large number of observations where a widow or a wife has also given their own occupation, though the order of listing their own or their husband's occupation is not consistent. Here OccCANINE performs better than IPUMS coding, as OccCANINE returns two or more HISCOs. Still, it requires the researcher to go though these cases manually.

The manual work left after using OccCANINE is still quite overwhelming. For example, I have over 12,000 unique observations with the word "änka" in the occupation string.

My question is: Is it possible for OccCANINE to account for titles with widow or wife within them?

Regarding the solution, there are some titles that despite including the word "änka" or "fru" are legitimate occupations: Kallskänka (garde manger) Hjälpfru (housewife's maid) Frukthandlare (fruit seller) Jungfru (depending on the context, either a young unmarried woman or female servant)

christianvedels commented 2 months ago

This is a data problem. OccCANINE can only classify occupations if you input strings which describes the persons occupation or the occupation, that should be ascribed to them. However, OccCANINE might still be useful in solving this data cleaning problem. And we have included the OccCANINE.finetune() method exactly for these type of cases.

The problem of what occupation to assign widows does not have a trivial answer. In some applications "captain's wife" should indeed be coded as a captain. In other applications, this leads to bias as described. Similarly, for people who are retired. If you want to measure social status of an individual, then it might be sensible to give them their earlier occupation rather than bunching everyone into the label 'retired'.

The default behavior among researchers seems to be to include such occupational descriptions. As such, this is the behavior that OccCANINE has picked up in training. Nevetheless, OccCANINE can also help with solving the problem via fine-tuning. We describe how a fine-tuning works in general in this colab notebook.

For this specific issue we suggest one of the two following options:

  1. Finetuning a version of OccCANINE to do the binary classificaiton 'wife of' / 'real occupation' (instead of HISCO codes).
  2. Finetuning a version of OccCANINE general fine-tuning with you own data

We believe that option 1 is likely to be the most efficient for which reason we will describe it in full. Option 2 involves assigning '-1' to all the cases, and then running the finetuning procedure as described in the notebook.

Option 1 OccCANINE understands occupations, and as such is a natural starting point for a model, which can solve this other classification problem of 'wife of' versus 'real occupation'.

Step 1 (finetuning data): Take N random observations from your intended data and manually label these as widows / not widows. In practice this can be done as a binary (0 or 1) variable with 'wife_of' which encodes whether the occupation is the wife of or not.

How large should N be? It is hard to know for certain, but 1000 unique observations is a reasonable first step. This is around 3 hours of labelling work assuming 10 seconds per case. Depending on performance you can always add more.

Step 2 (running finetuning):

Loading data:

df = pd.read_csv( "path/to/data" ) # Containing 'occ1', 'lang' and 'wife_of'
label_cols = ["wife_of"] # List columns with labels

Running finetuning

model = OccCANINE()

model.finetune( # See more options in the documentaiton of .finetune()
    df,
    label_cols = label_cols,
    batch_size = 32, # Increase to largest number possible on your computer
    save_name = "OccCANINE_wife_of_Swedish", # Choose suitable name
    new_labels = True, # Instructs to not assume HISCO codes
    verbose_extra = True, # Give updates while running
    epochs = 10 # How long should it run for?
)

Step 3 (Using the finetuned OccCANINE):

# Load model
model = OccCANINE("Finetuned/OccCANINE_wife_of_Swedish")

# Run it on your data
result = model.predict(df['occstr'])