Develop Evaluation Metric

There are different possible ways to formulate an evaluation:

Human extrinsic evaluation: Choose two or three clear test cases of semantic shift where an expert manually annotates different related concepts (to a target term) and how the related concepts vary across different year/decades. Then, compare the terms chosen by automatic methods to the human-labeled examples: how well do the clusters match expectations? This is not scalable, but we have all the control over what domain and cases we care about.
Lexical Semantic Shifts: generate labeled test sets or re-use the datasets created for semantic shift shared tasks on other languages. A good source of inspiration should be the AXOLOTL-24 Shared Task on Explainable Semantic Change Modeling. This is evaluation at scale but the out-of-domain evaluation might not reflect how well the models will perform for our data.
Framing: A popular task with NLP in the social sciences is the concept of Framing Detection. In a way, we are actually detecting how a target term is being framed in the different time epochs. Therefore, we could also look at related work to imitate the ways they have evaluated their results.

Semantics-of-Sustainability / tempo-embeddings