Open suvihe opened 2 months ago
This is a data problem. OccCANINE can only classify occupations if you input strings which describes the persons occupation or the occupation, that should be ascribed to them. However, OccCANINE might still be useful in solving this data cleaning problem. And we have included the OccCANINE.finetune()
method exactly for these type of cases.
The problem of what occupation to assign widows does not have a trivial answer. In some applications "captain's wife" should indeed be coded as a captain. In other applications, this leads to bias as described. Similarly, for people who are retired. If you want to measure social status of an individual, then it might be sensible to give them their earlier occupation rather than bunching everyone into the label 'retired'.
The default behavior among researchers seems to be to include such occupational descriptions. As such, this is the behavior that OccCANINE has picked up in training. Nevetheless, OccCANINE can also help with solving the problem via fine-tuning. We describe how a fine-tuning works in general in this colab notebook.
For this specific issue we suggest one of the two following options:
We believe that option 1 is likely to be the most efficient for which reason we will describe it in full. Option 2 involves assigning '-1' to all the cases, and then running the finetuning procedure as described in the notebook.
Option 1 OccCANINE understands occupations, and as such is a natural starting point for a model, which can solve this other classification problem of 'wife of' versus 'real occupation'.
Step 1 (finetuning data): Take N random observations from your intended data and manually label these as widows / not widows. In practice this can be done as a binary (0 or 1) variable with 'wife_of' which encodes whether the occupation is the wife of or not.
How large should N be? It is hard to know for certain, but 1000 unique observations is a reasonable first step. This is around 3 hours of labelling work assuming 10 seconds per case. Depending on performance you can always add more.
Step 2 (running finetuning):
Loading data:
df = pd.read_csv( "path/to/data" ) # Containing 'occ1', 'lang' and 'wife_of'
label_cols = ["wife_of"] # List columns with labels
Running finetuning
model = OccCANINE()
model.finetune( # See more options in the documentaiton of .finetune()
df,
label_cols = label_cols,
batch_size = 32, # Increase to largest number possible on your computer
save_name = "OccCANINE_wife_of_Swedish", # Choose suitable name
new_labels = True, # Instructs to not assume HISCO codes
verbose_extra = True, # Give updates while running
epochs = 10 # How long should it run for?
)
Step 3 (Using the finetuned OccCANINE):
# Load model
model = OccCANINE("Finetuned/OccCANINE_wife_of_Swedish")
# Run it on your data
result = model.predict(df['occstr'])
OccCANINE matches titles with the word "widow" (änka) and "wife" (fru) to their husbands occupation, if it is given. Depending on the research question, this could become an issue (i.e., inflates employment in certain occupations, inflates female labor force participation).
The problem is specifically with Swedish language model, as in the Swedish context, it was relatively common for women to list themselves with titles such as "captain's wife". For example, "advokatsänka" (translation: lawyer's widow) is matched to 12110 both in IPUMS and by OccCANINE.
There are a large number of observations where a widow or a wife has also given their own occupation, though the order of listing their own or their husband's occupation is not consistent. Here OccCANINE performs better than IPUMS coding, as OccCANINE returns two or more HISCOs. Still, it requires the researcher to go though these cases manually.
The manual work left after using OccCANINE is still quite overwhelming. For example, I have over 12,000 unique observations with the word "änka" in the occupation string.
My question is: Is it possible for OccCANINE to account for titles with widow or wife within them?
Regarding the solution, there are some titles that despite including the word "änka" or "fru" are legitimate occupations: Kallskänka (garde manger) Hjälpfru (housewife's maid) Frukthandlare (fruit seller) Jungfru (depending on the context, either a young unmarried woman or female servant)