Closed shifucun closed 1 day ago
The svindex diff shows a few good improvement and some loss(ish); A major behavior change is to boost stat vars (with "number of") higher than the topics!
Updated dc/topic/Disabilities to reflect it's about population. Updated golden based on all the latest changes.
The larger embeddings model can understand semantics better (instead of token match). So we should try to preserve stop words. Including stop words would be a big change and we should do this in a controlled way.
This introduces an exception list that are related to stop words and should be kept.
With this, the query sentence will be "number of asian" instead of "number asian". Turns out this boosts the matching and the score a lot!
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_01_13_41_51.html