UCL / frrant

2 stars 0 forks source link

Search substitutions #20

Closed raquelalegre closed 2 years ago

raquelalegre commented 3 years ago

Specifically:

Antonino will provide a full list

raquelalegre commented 3 years ago

More added by Antonino:

raquelalegre commented 3 years ago

@peterbispham This is not needed until later on in the project. If you could document here or give us some indication on how we can do it later on using the current ProstgreSQL DB before you leave, that'd be great.

tim-band commented 3 years ago

I have investigated using collation for this, and it won't work. Firstly, Django uses UPPER(text) LIKE UPPER("%" || keyword || "%") to implement icontains. Moreover, LIKE or ILIKE are not affected by collation in Postgres, only ORDER BY, sadly, so we will not go to space today.

Possibilities are all a bit grim: (1) Create new shadow fields in the models that might contain Latin (or create new tables). Every time the field changes, change the shadow field to match, but with all 'v's changed to 'u's, 'um est's to 'umst's and so on. Then similarly fold the search keywords and search these shadow fields. or (2) Search keywords for either side of these folds (like, search for 'um est', 'umst', 'u' or 'v'). If you find an 'um est', say, replace it with 'umst' and make a query that uses UPPER(REPLACE(field,'um est','umst')) LIKE UPPER("%keyword%"). If you find an 'umst' in the keywords you don't need to replace it with anything but you still need to do the REPLACE('um est','umst') in the SQL query. Yucky, but at least you don't have to denormalize the database.

Maybe I dislike (2) less. It wouldn't be too hard to add the user alteration of Latin folds then.

raquelalegre commented 3 years ago

We've just had a chat about how to proceed.

@tim-band is going to try to implement a simple solution so we can test its performance. He'll add to the search string that gets sent to Postgres all the possible replacements given by a shorter version of the list of substitutions (e.g. u<->v, y<->i, adr<->arr, adt<->att). We'll then see how it performs and if we need to do any database tuning. There might be some ideas in the blog post linked in #8. It might also be possible to explore how to implement #64.

I'll look into using ElasticSearch in a way that doesn't have a big impact in the current functioning of the database, but can still be used for the search, in case what Tim tests has poor performance. We might get away with a mirror of the postgres DB that we only use for the search functionality, which gets refreshed with a cronjob that runs a few times a day. That might make it still usable and performant, without us having to review all the data input and edition functionality to make sure we keep the ElasticSearch mirror up to date.

We still need to explore how this is going to work beyond the end of our contribution to the project, as it needs to be a solution with near 0 maintenance costs.

raquelalegre commented 3 years ago

@apitta0188 we have a question for you. From the list of substitutions above, the words to the right of the : are they examples or the actual list of substitutions you want? For example, if you just search for u do we need to return everything that contains both a u and a v, or should we focus specifically on finding texts containing uua or uva when you enter uua (and same with ouis/ovis, auus/avus)? Further example: if someone enters uarro on the text search box, do we also return texts containing varro?