WorldHistoricalGazetteer / whgazetteer

World Historical Gazetteer platform
http://whgazetteer.org
BSD 3-Clause "New" or "Revised" License
18 stars 7 forks source link

Search algorithms for Elasticsearch index and PostgreSQL database #70

Open kgeographer opened 2 years ago

kgeographer commented 2 years ago

The WHG search function offers two options - search of the "union index", and search of the public records in the relational database. Pre- and post-search filters allow for narrowing search results by area or region, by broad place category and/or narrow type, and by timespan.

The most glaring shortcomings in WHG search concerns the matching of place name search terms. Currently, the name lookup attempts a match of the exact string entered with any name variant found in the WHG index (or database in that case). Because existing records may not include a name variant with the exact spelling entered, good potential matches are often missed. The search against names needs to find similar names that are within bounds entered into the "SPATIAL" filter.

This requirement overlaps with Issue #68, which deals with name matching in the Wikidata reconciliation process using Python-wrapped Elasticsearch query language. However, it also requires a similar solution for searches against the relational database, which are currently performed with a simple Django filter function. More options are possible using SQL directly and PostgreSQL 'fuzzy string matching' functionality and spatial filters.

A sandbox environment for the WHG Elasticsearch index instance is available, so knowledge of Python/Django or the WHG codebase generally is not essential. That said, WHG wraps ES queries in Python using the "official" Elasticsearch Python client, so that code would help further.

For the database requirement, knowledge of Django and/or PostgreSQL is needed. A sandbox environment could be set up in fairly short order.