FHNW-IVGI / Geoharvester

NDGI Project Geoharvester
10 stars 1 forks source link

[NLP, Backend] Extract location information from search query to narrow down results #23

Open FStriewski opened 1 year ago

FStriewski commented 1 year ago
Requested as feature by Pasquale/David 21/3

User story:

As a user I want to get results based on location keywords in my query (using Swissnames).

Description:

Swissnames is the largest available collection of geographic names for Switzerland and allows their translation into coordinates (LV03 LN02, LV95 LN02). The dataset comes in various formats (.gdb, .shp., .csv) for point, line and polygon features. If a geographic name is used in the search query, it could be used to filter results based the coordinates (using Redis geospatial indexing) to narrow down the results.

Cases:

  1. No geographic name is used or found in the query: In this case, datasets from "BUND" and "geodienste.ch" should get highest priority in the result ranking
  2. A geographic name is used ("Raumkonzepte in Lensburg"). Extract the name for the lookup, then favor the results from the Kanton it belongs to.
  3. Multiple matches are found in Swissnames

Considerations:

Ressources:

p1d1d1 commented 1 year ago

@FStriewski you don't need to ingest a copy of the dataset. You can use the API: https://api3.geo.admin.ch/services/sdiservices.html#search

FStriewski commented 1 year ago

Thanks! So how does this scale for a production system? Is there any limit on the number of requests per time intervall that might cause issues down the line? Or won`t that apply because both services will be hosted by Swisstopo?

p1d1d1 commented 1 year ago

@davidoesch can you answer here?

davidoesch commented 1 year ago

geo.admin.ch/terms-of-use

20 req / minutes

But: I don't think that the peaks from the requests from a normal usage day will hit the infra

So : on normal usage : no danger Just don't scrape the data via services

FStriewski commented 1 year ago

Thanks David.

I checked the Redis documentation yesterday, however, its geospatial indexing / geosearch features do not support bounding boxes but point coordinate tuples, distances and radius/box (https://redis.io/commands/geosearch/).

While there is probably a way to put this to use, I was wondering if we are about to overengineer things. If I got you right, the goal is to get the Kanton (or Bund + Geodienste) of the location by Swissnames. Why not have the user provide that information in the first place, e.g. by a dropdown in the UI and a parameter in the API? We planned to add some filters anyway. This way we also don`t have to worry about the issue that some names are used in multiple Kantons.

Am I missing something?

davidoesch commented 1 year ago

with a geosearch in REDIS in combination with the location search of api.geo.admin.ch lik https://api3.geo.admin.ch/rest/services/api/SearchServer?searchText=Lenzburg&type=locations for the following use case: I Have Search "Raumkonzepte in Lenzburg" . REDIS checks if the strings are a) just fill words like "in, at," etc and then checks if the Nouns in the search are either in redis DB itself or give a result in location search of api.geo.admin.ch. if the latter is the case , use the result of api.geo.admin.ch to extract lat lon from the repsonse and then pass it to redis geosearch to provide a list of possible results

So to have a user interface to enter "Raumkonzepte in Lenzburg" and then get a list of results with links similar as today in the POC geoharvester, just with an additional Field containng the Location Name ( if there are multiple results

To achieve this use case, you could implement a multi-step process:

1.Preprocess the search query by removing any stop words like "in, at," etc. This will help to identify the important keywords in the search query that can be used for further processing.

3.Identify the nouns in the search query and check if they are present in the Redis database. You could use a natural language processing (NLP) library like spaCy or NLTK to extract the nouns from the search query.

5.If the nouns are not present in the Redis database, then use the location search of api.geo.admin.ch to obtain the lat/lon coordinates for the location. You could make an HTTP GET request to the API endpoint with the search query as a parameter, and then parse the JSON response to extract the coordinates.

7.Once you have the lat/lon coordinates, you can use Redis geosearch to find all the possible results near that location. Redis supports geospatial queries through the use of the GEOADD and GEORADIUS commands. You could add the location to Redis using the GEOADD command and then query for nearby locations using the GEORADIUS command with the coordinates and a search radius.

9.Finally, you can return the list of possible results to the user. You could sort the results by distance from the search location to provide a ranked list of results.

Some pseudo code

search_query = "Raumkonzepte in Lenzburg"

Step 1: Preprocess the search query by removing stop words preprocessed_query = remove_stop_words(search_query)

Step 2: Check if nouns in the search query are in Redis database nouns = extract_nouns(preprocessed_query) redis_results = query_redis(nouns)

Step 3: If no results found in Redis, use location search API to get lat/lon if not redis_results: location_results = query_location_api(preprocessed_query) lat, lon = extract_lat_lon(location_results) redis_results = query_redis_by_location(lat, lon)

Step 4: Use Redis geosearch to find all possible results near location geosearch_results = query_redis_by_geosearch(lat, lon, radius)

Step 5: Return a list of ranked results to the user ranked_results = rank_results(geosearch_results) return ranked_results

Note that the pseudocode assumes the implementation of several helper functions, such as remove_stop_words, extract_nouns, query_redis, query_location_api, extract_lat_lon, query_redis_by_location, query_redis_by_geosearch, and rank_results. You would need to implement these functions using appropriate libraries and APIs, depending on your specific programming language and environment.

Is it overengineered? Maybe

FStriewski commented 1 year ago

Yes, I had something similar in mind. Stopword removal and noud detection (afaik) we already have implemented.

I see a couple of potential problems:

So thats why I was thinking "have the user provide more context". I am not too familiar with the spatial extend of various datasets - so thinking only on Kanton-level might not be good enough, indeed.

Then again, what is the problem we really want to solve here? Better matching of query and results? Couldn`t we achieve this easier with

p1d1d1 commented 1 year ago

To take stuff simple, IMHO the "search by location" is just a filter. We can set kind of a check box "Search/Filter by location": if checked we present the user a little map where we ask him/her to identify an area of interest by drawing a point/rectangle. The "text" search and the "Search by location" can work together or independantly. E.g:

davidoesch commented 1 year ago

search by location via filter is an easy solution, already availbale by geocat.ch (selecting Provider on the left side) or via GUI a bbox as in eg https://suche.kartenportal.ch/#bbox=7.959899902343747,46.27529514370323,9.756774902343748,47.16790406422331&q=&date_from=0&date_to=9999&scale_from=&scale_to=&libraries=&scanned_only=0&series=&map=road .

caveat: another GUI element you have to deal with: a map and toggle buttons to actviate or deactivate. Implementing a map based viewer seems to be easiest way

however: the user usually has a a question, and state of the art these days ( see the raise of those LLM's like chat GPT) is: you state your question --- no more fiddling around with user interfaces... solving it all from one singel textbox would be innnovative

Adressing @FStriewski points

-> Take the first ten results you get from https://api.geo.admin.ch/rest/services/api/SearchServer?searchText=Bern&origins=kantone,district,gg25&type=locations What we do then in map.geo.admin.ch search: we first show the result for "Kantone", then second "Bezirk"district then third "Gemeinde" gg25 the results in those group are ranked by "rank" and "weight"

Rank the layers the way that you first list the datasets from BUND and GEODIENSTE if they cover the topic then the other datasets

-having the user define a target Kanton (optional argument) -> this can be done in the Search itself IMHO with BOOL paramter like in my poc with ""Raumkonzept" "KT_AG""

-Extract geographic names within preprocessing from the abstract, and store it in a keyword field, to check for exact matches? -> Yeah thought about that as well: use OSM Overpass to get the 10 biggest cities / places and store them in keywords But redis will explode cause it will have a lot od dupplicates