There are different possible ways to formulate an evaluation:
Human extrinsic evaluation: Choose two or three clear test cases of semantic shift where an expert manually annotates different related concepts (to a target term) and how the related concepts vary across different year/decades. Then, compare the terms chosen by automatic methods to the human-labeled examples: how well do the clusters match expectations? This is not scalable, but we have all the control over what domain and cases we care about.
Lexical Semantic Shifts: generate labeled test sets or re-use the datasets created for semantic shift shared tasks on other languages. A good source of inspiration should be the AXOLOTL-24 Shared Task on Explainable Semantic Change Modeling. This is evaluation at scale but the out-of-domain evaluation might not reflect how well the models will perform for our data.
Framing: A popular task with NLP in the social sciences is the concept of Framing Detection. In a way, we are actually detecting how a target term is being framed in the different time epochs. Therefore, we could also look at related work to imitate the ways they have evaluated their results.
How can we compare different approaches?