deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.09k stars 1.87k forks source link

Sparse retriever lemmatizer #1475

Closed flozi00 closed 7 months ago

flozi00 commented 3 years ago

Is your feature request related to a problem? Please describe. More semantic like search using sparse retriever Performance

Describe the solution you'd like Spacy lemmatizer is available for multiple languages returns good results most times. So each document could be stored in it's base form too, German example: text: Ich gehe jeden zweiten Tag Fussball spielen Base: Ich gehen jeden zwei Tag Fussball spielen

With an query like: Original: wann gehst du Fussball spielen ? Base: wann gehen Ich Fussball spielen

The version after lemmatizer would become higher score. In same step I'd like to open the idea of query expander again.

What do you think ?

Describe alternatives you've considered I don't know an good alternative for it

Additional context Add any other context or screenshots about the feature request here.

bogdankostic commented 3 years ago

Hey @flozi00! This seems to be an interesting feature. An alternative to this could be to make use of Elasticsearch's stemming. However, it seems that Elasticsearch's stemming does not always produce the same stem for words with the same root (see for example here).

@tholor What do you think? Would this be something that we see as part of haystack?

tholor commented 3 years ago

Interesting idea, but I agree with @bogdankostic that leveraging elastic's existing components (e.g. stemmer, synoynms, analyzer ...) will probably be more scalable and meaningful. This has the advantage that everything happens on the index side and we don't need to duplicate the documents in an index (the "original" and the "lemmatized" one).

However, I see quite some potential to improve the handling of these elastic options in Haystack. There could be options to automatically generate lists of synonyms (see also https://github.com/deepset-ai/haystack/issues/841), configure stemmer, or create lists of questions that can be answered from a doc, generate a list of "keywords" for a doc ....

nickchomey commented 2 years ago

For whatever it is worth, I also think it would be very useful to be able to incorporate spaCy into haystack pipelines - particularly for the Lemmatization. It is my understanding that Stemming/lemmatization is undesirable for full semantic/transformer capabilities, but in the event that someone wants to do just keyword searching, lemmatization is vastly superior to stemming.

Also, spaCy seems to have a very similar ethos/focus as Haystack - consolidating state of the art techniques and tools into one package that is accessible to practitioners. Beyond top-notch NLP capabilities, they offer immense multilingual support and, since v3.0, also have an entire transformer mechanism that integrates with huggingface models. So, there really must be a lot of overlap/synergy with Haystack and surely it could be added in some meaningful way into your stack!

Also, it is all in Cython which it doesn't appear that Haystack uses (but I could be wrong) which makes it immensely more performant.