datacommonsorg / website

Code for the Data Commons website
https://datacommons.org
Apache License 2.0
24 stars 82 forks source link

[NL] Did not recognize place correctly! #2367

Open pradh opened 1 year ago

pradh commented 1 year ago

[what are the biggest cities in mexico] -- didn't recognize mexico and that caused returning weird SVs [what are the biggest cities in new england] -- picks out "england". but if using 'New England' in the query, the recognition works...

pradh commented 1 year ago

[US states with the most highly educated rural areas] -- recognized statesville

pradh commented 1 year ago

The "US states" issue has been fixed.

pradh commented 1 year ago

Given the upcoming place recognition changes, nothing to fix here, unless its very common like "US states".

CC @jehangiramjad

jehangiramjad commented 1 year ago

another example: "earthquakes in toronto" does not recognize Toronto. Similarly, "earthquakes in canada" does not recognize Canada. Surprisingly, the offline colabs using the same NER library does detect the location strings so this will need further analysis.

Analysis for this issue:

In our current place detection heuristics, we have the following (let's use an example to help: "median income in new york city"):

Using heuristics, it is possible we can have BOTH "new york" and "new york city" detected as possible place strings. But right now, we are preferring the longer over the shorter (if shorter is entirely contained inside the longer).

The two queries above have the same issue but results in no place detection:

[earthquakes in canada] => finds "canada" but also one heuristic somehow finds "earthquakes canada" as a place. So we prefer the longer one (since "canada" is contained inside "earthquakes canada"). If we remove this restriction, both "canada" and "earthquakes canada" will be passed through Maps and DC recon and eventually only "canada" will be resolved.

[earthquakes in toronto] => same issue.

Potential fix: This (prefer long vs short) was done when we supported only one main place. Now that we support multiple places, it might make sense to get ride of this restriction.