add typeahead endpoint api

danieleguido commented 7 years ago

Very simple endpoint can be: /api/services/suggest?q= and will use whoosh NGRAMWORDS The results:

{
  "query": "blibli",
  "results": [
    "bliblibli"
  ]
}

uf0 commented 7 years ago

should we add a parameter to specify on which part of the collection create the ngrams?

e.g.: on the 'map' page the suggestions should refer only to the documents that have coordinates.

/api/services/suggest?q=blibli&filter={'data__coordinates__isnull':False}

danieleguido commented 7 years ago

It seems that we can't use whoosh for this specific purpose I tried this solution on stackoverflow and the suggestion mechanism works on single word only. E.g. with this schema Schema(typeahead = TEXT(spelling=True, stored=True, phrase=False)) and with stored TEXT of par M. Sergent : [photographie de presse] / Agence Meurisse, 1926, by typing photographie de p we obtain:

[u'photographie']
[u'de', u'des']
[u'de', u'par']

Apparently we can use postgresql - once it's correctly configured 👍 cfr . http://rachbelaid.com/postgres-full-text-search-is-good-enough/#1 I'd also suggest to substitute whoosh search engine with postgresql for this specific goal. using trigrams: https://www.sitepoint.com/awesome-autocomplete-trigram-search-in-rails-and-postgresql/ where the author says:

One way of improving the speed of the search is by having a separate column that will hold all the trigram sequences of the title column. Then, we can perform the search against the pre-populated column. Or we can make use of ts_vector, but that would become useless with fuzzy words.

More on word and trigrams for postgres: https://www.postgresql.org/docs/9.1/static/pgtrgm.html and https://stackoverflow.com/questions/10622021/suggest-like-google-with-postgresql-trigrams-and-full-text-search

danieleguido commented 7 years ago

I suggest we fill a proper table containing all the possible (and valid) trigram combination, like https://www.sitepoint.com/awesome-autocomplete-trigram-search-in-rails-and-postgresql/ What do do with the multilanguage? Should we use the config=simple?

of course this should match the words stored into search_vector

bianchimro commented 7 years ago

is the language known at search time? or should the engine return results in all available languages?

danieleguido commented 7 years ago

@bianchimro I had the same doubt, since the language is known; but at the end the best option would be to get results no matter the language, so that things stay simple and coherent to what you get now on document and story search endpoint. Moreover it will be a lot easier for "multi-language people" ;)

How to deal with @uf0 suggestion to filter typeahead? If we use the document.search_vector field, we will get this out of the box; on the other hand, if we use a trigram table, we can somehow point to the related document. This way we apply the doc filters and we limit the list of trigrams to the related ones, and the trigram table stays simple... Any ideas?

danieleguido commented 7 years ago

Hi @bianchimro, I've added a Ngrams model in the feature/typeahead branch. In Ngram model you can find a very basic ngram tokenizer. Run the migration as SUPERUSER, as I've added the GIN index extension to ngrams table. Fill the ngram table (to test, use one document) with:

python manage.py task update_ngrams_table --model=document --pk=547

This adds bigrams and trigrams to the ngrams table from the indexable fields in document model.

Among the latest updates, I've added the endpoint for documents:

GET /api/document/suggest/?q=Affaires%20social

{ 
  "cached": false,
  "results": [
    "Social Affairs",
    "Social Affairs and",
    "Affaires",
    "fairer social",
    "Affaires trangres :",
    "social rules",
    "and social issues"
  ]
}

For the moment, up to 20 results are returned; and ngrams table contains bi-grams and tri-grams.

Check http://shisaa.jp/postset/postgresql-full-text-search-part-2.html for more info

danieleguido commented 7 years ago

Closed, ref to master branch.

C2DH / miller

add typeahead endpoint api #4