Closed uf0 closed 7 years ago
should we add a parameter to specify on which part of the collection create the ngrams?
e.g.: on the 'map' page the suggestions should refer only to the documents that have coordinates.
/api/services/suggest?q=blibli&filter={'data__coordinates__isnull':False}
It seems that we can't use whoosh for this specific purpose
I tried this solution on stackoverflow and the suggestion mechanism works on single word only. E.g. with this schema Schema(typeahead = TEXT(spelling=True, stored=True, phrase=False))
and with stored TEXT of par M. Sergent : [photographie de presse] / Agence Meurisse, 1926
, by typing photographie de p
we obtain:
[u'photographie']
[u'de', u'des']
[u'de', u'par']
Apparently we can use postgresql - once it's correctly configured 👍 cfr . http://rachbelaid.com/postgres-full-text-search-is-good-enough/#1 I'd also suggest to substitute whoosh search engine with postgresql for this specific goal. using trigrams: https://www.sitepoint.com/awesome-autocomplete-trigram-search-in-rails-and-postgresql/ where the author says:
One way of improving the speed of the search is by having a separate column that will hold all the trigram sequences of the title column. Then, we can perform the search against the pre-populated column. Or we can make use of ts_vector, but that would become useless with fuzzy words.
More on word and trigrams for postgres: https://www.postgresql.org/docs/9.1/static/pgtrgm.html and https://stackoverflow.com/questions/10622021/suggest-like-google-with-postgresql-trigrams-and-full-text-search
I suggest we fill a proper table containing all the possible (and valid) trigram combination, like https://www.sitepoint.com/awesome-autocomplete-trigram-search-in-rails-and-postgresql/
What do do with the multilanguage? Should we use the config=simple
?
of course this should match the words stored into search_vector
is the language known at search time? or should the engine return results in all available languages?
@bianchimro I had the same doubt, since the language is known; but at the end the best option would be to get results no matter the language, so that things stay simple and coherent to what you get now on document and story search endpoint. Moreover it will be a lot easier for "multi-language people" ;)
How to deal with @uf0 suggestion to filter typeahead? If we use the document.search_vector
field, we will get this out of the box; on the other hand, if we use a trigram table, we can somehow point to the related document. This way we apply the doc filters and we limit the list of trigrams to the related ones, and the trigram table stays simple... Any ideas?
Hi @bianchimro, I've added a Ngrams model in the feature/typeahead
branch. In Ngram model you can find a very basic ngram tokenizer. Run the migration as SUPERUSER, as I've added the GIN index extension to ngrams table.
Fill the ngram table (to test, use one document) with:
python manage.py task update_ngrams_table --model=document --pk=547
This adds bigrams and trigrams to the ngrams table from the indexable fields in document model.
Among the latest updates, I've added the endpoint for documents:
GET /api/document/suggest/?q=Affaires%20social
{
"cached": false,
"results": [
"Social Affairs",
"Social Affairs and",
"Affaires",
"fairer social",
"Affaires trangres :",
"social rules",
"and social issues"
]
}
For the moment, up to 20 results are returned; and ngrams
table contains bi-grams and tri-grams.
Check http://shisaa.jp/postset/postgresql-full-text-search-part-2.html for more info
Closed, ref to master branch.
Very simple endpoint can be:
/api/services/suggest?q=
and will use whoosh NGRAMWORDS The results: