datacommonsorg / website

Code for the Data Commons website
https://datacommons.org
Apache License 2.0
20 stars 73 forks source link

[Stop Words] Introduce stop words exception list and add "how many", "number of" to this list #4416

Closed shifucun closed 1 day ago

shifucun commented 4 days ago

The larger embeddings model can understand semantics better (instead of token match). So we should try to preserve stop words. Including stop words would be a big change and we should do this in a controlled way.

This introduces an exception list that are related to stop words and should be kept.

With this, the query sentence will be "number of asian" instead of "number asian". Turns out this boosts the matching and the score a lot!

https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_01_13_41_51.html

shifucun commented 1 day ago

https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_01_13_41_51.html

This is the diff https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_01_13_41_51.html

shifucun commented 1 day ago

The svindex diff shows a few good improvement and some loss(ish); A major behavior change is to boost stat vars (with "number of") higher than the topics!

shifucun commented 1 day ago

Updated dc/topic/Disabilities to reflect it's about population. Updated golden based on all the latest changes.