UCL / frrant

2 stars 0 forks source link

Clarify why we need ElasticSearch #8

Open raquelalegre opened 3 years ago

raquelalegre commented 3 years ago

Estimate: 3 days

Most of the search requirements can be satisfied by Postgres (partial search, search by similarity, etc.). Perhaps we can get rid of ElasticSearch altogether?

From the Reqs doc:

Search mechanism for ancient texts, summaries and comments, to accept terms in English, Latin and Greek.
Search should account for different spelling variations, symbols and grammatical forms, whether by use of wildcards ("*"), automatic substitution (e.g. v ↔ u) or more sophisticated rules
Search should not be affected by editorial symbols and other non-textual characters.
Option to restrict search to text in particular fields (e.g. fragment text, testimonia text, antiquarian essay), or to search across all content
raquelalegre commented 3 years ago
raquelalegre commented 3 years ago

We'll wait for some feedback once the project team starts using the application's search functionality but it doesn't sound like they'll have very complex search requirements. See #1.

raquelalegre commented 3 years ago

Looking further into ElasticSearch vs PostgreSQL, whereas the functionality requested by the team is not very complex, it will be less performant in PostgreSQL than Elastic as more text is added by the team. However, maintenance of Elastic is much more costly, and we'd have to change the current data ingestion functionality since so far everything has been done only in Postgres.

We could keep it simple using only Postgres in a way that is as performant as Elastic, as explained on this post. This requires some database tuning as well as some tweaking of the queries, which poses a problem since as this is a Django application and all SQL is autogenerated. It is possible to write bespoke SQL statements, but would be harder to maintain and less understandable.

Next steps would be:

raquelalegre commented 3 years ago

It seems like the text search is getting complicated with Postgres, so we are considering using ElasticSearch. We could:

  1. Replace Postgres completely with Elastic - not a very good idea as we might lose data from the production server
  2. Duplicate the data and save it both in Postgres and Elastic. We've considered two approaches:
    • Overload the save method that writes in the PostgresDB so it also writes to Elastic - this might also not be a good idea as we might miss places in the codebase where the database is modified and eventually end up with two different versions of the data.
    • Have a process (e.g. something configured with celery and rabbitmq, for example) that does mirroring between the 2 databases, always keeping Postgres as the main DB.

We'd only use the Elastic version of the database for the search functionality of the website.

Some questions we need to address:

acholyn commented 4 months ago

might be worth using in regards to #433 #454 , otherwise, we should get rid of it