tholor commented 3 years ago

Is your feature request related to a problem? Please describe. With the new flexible Pipelines introduced in https://github.com/deepset-ai/haystack/pull/596, we can build way more flexlible and complex search routes. One common challenge that we saw in deployments: We need to distinguish between real questions and keyword queries that come in. We only want to route questions to the Reader branch in order to maximize the accuracy of results and minimize computation efforts/costs.

Describe the solution you'd like

New class QueryClassifier that takes a query as input and determines if it is a question or a keyword query. We could start with a very basic version (maybe even rule-based) here and later extend it to use a classification model.
The run method would need to return query, "output_1" for a question and query, "output_2" for a keyword query in order to allow branching in the DAG.

Describe alternatives you've considered Later it might also make sense to distinguish into more types (e.g. full sentence but not a question)

Additional context We could use it like this in a pipeline

stefanondisponibile commented 3 years ago

I like the idea @tholor . I'd be willing to help.

tholor commented 3 years ago

Great! Do you want to create a first draft PR @stefanondisponibile ? Happy to guide you then and give early feedback.

The core will be of course the classification of the string into "query" or "keyword". Either via a small rule set or alternatively via small, light-weight ML model.

The requirements for running this class as a node in the Pipeline are then relatively straight forward:

run() method
class attribute outgoing_edges = 2
returning a tuple (query, "output_1") or (query, "output_2")

Just let me know if you need any help to get started!

stefanondisponibile commented 3 years ago

Thanks for the hint @tholor , I guess I'll do a little analysis and come back with some ideas, or directly a draft PR.

Threepointone4 commented 3 years ago

@tholor I have implemented a simple method based on my understanding, let me know if this is proper or not. If yes I will go ahead and create a branch and push the changes.

question_check can be an ML or rule-based function which checks query is question or keyword. Right now I have done this using some rules.

Thanks

tholor commented 3 years ago

Hey @Threepointone4, this looks good! You also need a class attribute to define the number of the two different outgoing edges:

class QueryClassifier():
    outgoing_edges = 2
def ...

The interesting part will be question_check(), I guess. Would be great if you can create an early draft PR. Maybe you can then collaborate with @stefanondisponibile on finalizing it together. I am sure merging your two approaches will yield something great!

I will of course give feedback and support you along the way.

Threepointone4 commented 3 years ago

@tholor Yes class attribute is there but it's not showing in the screenshot. I am testing with some flows and will create a draft PR soon and will share ASAP.

stefanondisponibile commented 3 years ago

Good job @Threepointone4 . Have you implemented the classifier itself yet. I think that should be also something that any user can pass or tune on their own. The node receiving an input and going either to one output node or another should be independent from the specific need of a query classifier (I may want to either go to different nodes depending on some other logic). What you think @tholor ?

lalitpagaria commented 3 years ago

Sorry for asking question bit late.

I am just curious about use of classification model for query identification. Isn't bit heavy to determine whether query it keyword or question by model? I might not getting bigger picture.

For simple purpose I rather go by how elastic search solve it. We can have simple DSL or borrow concept from Lucene's DSL as follows -

For keywords: term: <query>
For questions: query: <query>
For NERs: ner: <query>
For fuzzy queries: fuzzy: <query> etc

Also supporting different type of filters - range, bool, in, not_in, greater, lower, matched, not_matched etc Ultimately DSL parser will decide which part of query will go where and then it will join results from these parts.

Again apologies if my understanding is wrong here.

tholor commented 3 years ago

The use case is straight forward: Imagine you have a search bar somewhere on a website. Naturally, some people will do keyword queries (e.g. "address deepset"), others will ask natural questions ("What is the address of deepset?"). In the case of keyword queries we just want to call the retriever (as the reader would be a waste of resources), in the case of natural questions we want to go the full mile and include the reader / generator. With the above DSL, you would need two different search bars or something like a checkbox in the UI to determine what type of query to trigger. That is not feasible from a UX perspective in most cases. That's why we want a very simple, light-weight query classifier (no BERT).

Besides that use case, we have plans to determine in more detail what type of question it is (e.g. one directed to text documents or tables or SQL DB) and route it accordingly in our DAG.

Does this make sense for you?

lalitpagaria commented 3 years ago

With the above DSL, you would need two different search bars or something like a checkbox in the UI to determine what type of query to trigger. That is not feasible from a UX perspective in most cases.

I was suggesting to use single search bar but with keywords support. Like searching in Kibana.

Naturally, some people will do keyword queries (e.g. "address deepset"), others will ask natural questions ("What is the address of deepset?").

How about user want to use reader / generator but typing queries in non natural question ways?

Yes I understand the need of QueryClassifier but it should have capability to override default nature, ie user able to call reader/generator with non-natural queries and call retriever only for natural question type queries.

tholor commented 3 years ago

I was suggesting to use single search bar but with keywords support. Like searching in Kibana.

How would you separate a keyword query from a natural language one then?

How about user want to use reader / generator but typing queries in non natural question ways?

I think we have a misunderstanding here. The QueryClassifier will just be an optional node that you can put in your pipeline DAG. Every user is of course still free (and encouraged!) to build their own custom pipeline. The standard ExtractiveQAPipeline won't include this node and I personally believe that we will see many variants of QueryClassifier nodes and people will start to implement their own depending on their "categories of queries".

lalitpagaria commented 3 years ago

How would you separate a keyword query from a natural language one then?

I can have reserve keywords (keyword, question, generate, summarise, classifier, etc) which then understand by QueryParser. So UI will pass search string to parser and it then have (filter support as well like Kibana). By default parser assume all queries as keywords queries.

"keyword: " -> Parser will treat it as keyword query
"question: " -> Parser will treat it as natural question query
"keyword: AND summarise: all" -> Parser will run summarisation after keyword query
"keyword: AND summarise: each" -> Parser will run summarisation for each docs retrieved by keyword query separately
"question: AND summarise: all" -> Parser will run summarisation after keyword query
"question: AND summarise: each" -> Parser will run summarisation for each docs retrieved by keyword query separately
"keyword: AND generate: all" -> Parser will run generator after keyword query
"keyword: AND generate: each" -> Parser will run generator for each docs retrieved by keyword query separately
"question: AND generate: all" -> Parser will run generator after question query
"question: AND generate: each" -> Parser will run generator for each docs retrieved by question query separately
"keyword: AND generate: all AND query_docs:10 AND generate_docs:2" -> Parser will run summarisation after keyword query retrieve up to 10 docs and generator will generate 2 docs.
"classifier: " -> Parser will check with classifier to determine type of query and then execute it

Ultimately it will create one DSL (Lucene can be tweaked for this), which QueryParser will understand and parse it and then execute it. Treat Generator, Summariser, Reader, Classifier etc as different optional components of the Haystack and on production user/company can add/remove based on need via new Pipeline APIs (Similar to elasticsearch plugins/add-ons) at runtime.

Anyway I understand it might not be aligned with current goals. But yes I do understand the need of QueryClassifier.

stefanondisponibile commented 3 years ago

hello @lalitpagaria and thank for your support on the issue. I don't completely understand the use case of your last comment, but I think what you're trying to say is what you actually said: "I was suggesting to use single search bar but with keywords support. Like searching in Kibana." / "How about user want to use reader / generator but typing queries in non natural question ways?". This last one is quite interesting by the way, and an interesting feature for the classifier.

However, to understand why we feel the need of a QueryClassifier imagine the following situation:

Our index may look something like this:

[
  {
    "id": 1,
    "text": "Haystack is an end-to-end framework for Question Answering & Neural search that enables you to do a lot of fancy things."
  },
  {
    "id": 2,
    "text": "Haystack is cool."
  }
]

Now, let's discuss the problem before rushing to the obvious solution. And of course, let's assume we don't have a QueryClassifier yet. Let's start by pretending we don't even have Haystack at all, just a .json in your filesystem.

Some user reaches our search bar and types in the following query:

What can I do with Haystack?

Alright, we have tranformers or some model to extract the answer from the query. So we load it and run the query against each of the contexts in our index:

[
{
    "id": 1,
    "context": "Haystack is an end-to-end framework for Question Answering & Neural search that enables you to do a lot of fancy things.",
    "question": "What can I do with Haystack?" ,
    "answer": "do a lot of fancy things",
    "accuracy": 0.548
  },
{
    "id": 2,
    "context": "Haystack is cool.",
    "question": "What can I do with Haystack?" ,
    "answer": "cool",
    "accuracy": 0.326
  }
]

So we're happy, we keep all the records with a satisfying accuracy (let's assume 50% here), order them accordingly, and give back the best one to the user:

What can I do with Haystack? => do a lot of fancy things

However, the user types in another query:

Haystack

Alright, let's run this weird query against our docs:

[
{
    "id": 1,
    "context": "Haystack is an end-to-end framework for Question Answering & Neural search that enables you to do a lot of fancy things.",
    "question": "Haystack" ,
    "answer": "Question Answering & Neural search",
    "accuracy": 0.326
  },
{
    "id": 2,
    "context": "Haystack is cool.",
    "question": "What can I do with Haystack?" ,
    "answer": "cool",
    "accuracy": 0.319
  }
]

Well, 🤔 I don't know, what should I do with this? Having to judge, I feel that the document with id: 2 would be more relevant here, because it's shorter and the "Haystack" token has a major impact on it (i.e., to me the second document has a bigger score).

Haystack => ????

So, we have 2 problems at least:

We can't scale: we get slower and slower each time we index a new document, because we have to pass all the documents to our model (i.e. my Reader).
The user is not requesting something in a semantic fashion, so we're spending useless computation (and time) trying to figure out how to extract an answer from our context, using something that's not even a question. Odd, isn't it? 🤯

Elasticsearch to the rescue. 🦸

We're having a coffee with our colleague, complaining about our scaling problem, and he comes up with a nice solution.

Why not putting some search filter on top of you're current flow, so each time you will pass just a few docs to your Reader?

Elasticsearch sounds like the perfect solution for that! We throw away our json-database and index everything inside Elasticsearch. This let us also separate two concerns that were living in our model before: searching vs extracting.

So the new plan is using a Retriever to gather some documents related the user query (and rely just on that for our search) and then passing those results to a Reader that will extract the proper answer from that subset of our index.

Cool! We solved problem 1, and our structure is more robust and lean now! Yet problem 2 is still quite an issue 🤔 I mean, yeah, we can do something at the end of our flow, like having a look at our results, and if all the Reader score are bad maybe just returning the document having the best Elasticsearch score, but again...odd 🤷‍♂️

So we take another coffee with our colleague, and he goes like:

Well, in those cases, just skip the Reader.

So that's not a bad idea, pretty obvious, we could actually "skip the Reader", but our problem is identifying "those cases".

So @tholor opens his browser, goes to Github and tells:

"Hey, we need a QueryClassifier to identify those cases in which the search style could (optionally) be influenced by the fact that the user is (or is not) expressing semantically. This may influence the search Pipeline in many ways, surely by performing dynamically the answer extraction done by the Reader."

Some issues around this feature:

The classifier itself is the core of this issue. (let's forget about Pipelines for a second)
The classifier must be fast and not computationally intensive. (that's probably the main problem to face)
The classifier should be multilingual.

I've been trying to create a dataset taking the "Quora Question Pairs" and using Spacy to craft a "keyword" version of each question. To give you an example from it:

What is Morse code? 1
Morse code  0
What is the best backend for my app?    1
best backend app    0

Not bad, as a start, but I still haven't come to any lean solution. Moreover, unless having some tricks applied to the classifier, this wouldn't be multilingual.

Back to Pipelines: Binary/Dynamic Node.

Whatever you may call it, even if we had a classifier, there's another feature we would need: a Pipeline node to handle the result of the c̶l̶a̶s̶s̶i̶f̶i̶e̶r̶ callable and use it to point the Pipeline in the right direction of the graph. I think we could already submit a PR for this, @Threepointone4 kinda proposed it already, so along its idea, as a sketch:

class BinaryNode:
    def __init__(self, node_a: str, node_b: str, evaluate: Callable):
        self.node_a = node_a
        self.node_b = node_b
        self.evaluate = evaluate
        self.outgoing_edges = 2
        ...

    def run(self, **kwargs):
        a = self.evaluate(**kwargs) # evaluate function returns a Bool
        return kwargs.get("query"), self.node_a if a else self.node_b

This opens some ideas about having the "evaluate" function itself defining the next node id, but this approach would have some drawbacks to consider. However, thing is the evaluating function (or classifier) is injected, and that's it. If you want to skip the reader or whatever else based on the time of the day, from the starwars api or whatever function you desire, the user can handle that himself. That could also work as a middle-ground solution until a proper QueryClassifier is implemented. When it is, it will just inherit from the BinaryNode, as a specialized version of it.

What do you think?

lalitpagaria commented 3 years ago

@stefanondisponibile thank you detailed analysis. I agree with you. 👍

You have brought many interesting points.

tholor commented 3 years ago

@stefanondisponibile Awesome analysis / explanation! Very nice!

A few comments:

The classifier itself is the core of this issue. (let's forget about Pipelines for a second) The classifier must be fast and not computationally intensive. (that's probably the main problem to face) The classifier should be multilingual.

I totally agree with all of these points :+1:

I've been trying to create a dataset taking the "Quora Question Pairs" and using Spacy to craft a "keyword" version

I like the direction of this as it could be easily extended to other QA datasets out there. Another potential source for your training data might be information retrieval datasets containing (mostly) keyword queries.

Whatever you may call it, even if we had a classifier, there's another feature we would need: a Pipeline node to handle the result of the c̶l̶a̶s̶s̶i̶f̶i̶e̶r̶ callable and use it to point the Pipeline in the right direction of the graph. I think we could already submit a PR for this, [...]

Maybe I am missing something here, but we already have such functionality in Haystack. @tanaysoni and I discussed the initial design already quite intensively as they are a lot of options to implement such a "branching" in a DAG. What we settled on, for now, is somehow similar to what you sketched above - maybe a bit simpler though. This node would already work as of today in the Pipeline class:

    class QueryClassifier():
        outgoing_edges = 2

        def run(self, **kwargs):
            if "?" in kwargs["query"]:
                return (kwargs, "output_1")

            else:
                return (kwargs, "output_2")

    pipe = Pipeline()
    pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
    pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
    pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
    pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
                  inputs=["ESRetriever", "DPRRetriever"])
    pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
    res = p.run(question="What did Einstein work on?", top_k_retriever=1)
    print(res)

The key is really the classifier that would replace the naive if-condition from above.

stefanondisponibile commented 3 years ago

Thank you @tholor, and thanks for the catch on those datasets, do you have any particular link or resource I could check?

About your doubt, I'm probably overengineering then with that BinaryNode thing. I'll maybe make a PR to discuss the issue directly there without polluting this argument :)

Hopefully I'll make a PR for the RequestClassifier too, as soon as I can find a solution that's lean enough.

tholor commented 3 years ago

do you have any particular link or resource I could check?

I did not have anything particular in mind, but maybe one of those is helpful:

In any case, I think it would be great of getting a "first version" out there even if the accuracy is not perfect in the beginning. We can always iterate on the datasets (and potentially invest with some labeling power from our side). The design of the implementation and choice of model is currently more important, I believe. I don't know if you had already any thoughts here, but besides a fast inference speed it would be also great to not introduce crazy new dependencies in Haystack. The usual suspects that I see: lightweight NN (PyTorch) or some tree based method (sklearn). My gut feeling would be that a very simple model can already give us a pretty decent solution if it picks up on some standard words/patterns that indicate a question (What, where, How much, Is, Are ...).

Threepointone4 commented 3 years ago

Hi Sorry for late response. I have done simple solution for this as first pass. I have done simple comparison for this if the word starts with wh or certain words its question.

The question_check in second pass can be extended to a model. I am training a simple model will share soon.

@tholor what do you think ?

tholor commented 3 years ago

@Threepointone4 Yes, such a rule-based system could be a first, simple starting point. I can also see that we have multiple classes in the end. Maybe one "RuleQueryClassifier" and one "ModelQueryClassifier" - so people can choose.

Regarding your implementation: you are currently only returning a string in run(). You would need to make this a tuple of (query, out_put)

tholor commented 3 years ago

Hey @stefanondisponibile, just a quick follow up on this topic: did you already do any experiments on a "light weight" classifier or datasets that could be useful for training? Happy to support you on this topic and push it forward together if you want ...

stefanondisponibile commented 3 years ago

hey @tholor , I've experimented a bit during the holiday, but I still don't have something ready. I could upload the dataset somewhere to be tested against by others if that could speed up different solutions 🙂

tholor commented 3 years ago

Yes, I think this would be very helpful :) Maybe you could also quickly describe what modeling you already tried and what the results were (so that we can avoid double work)...

tholor commented 3 years ago

@stefanondisponibile would you mind uploading the dataset and describing very roughly what you already tried? Thanks :)

stefanondisponibile commented 3 years ago

I'm very sorry for being late @tholor, but at the moment I don't have too much time to wrap my mind around this.

Dataset.

However, I've uploaded the dataset(s) here (will update the meta as soon as possible, but it should be pretty straight forward: target is 1 for questions, 0 for keywords.)

Tests.

I've been goofing around a bit with scikit-learn, and also tried with a feed forward neural network with one hot encodings. Those didn't really satisfied me, but maybe someone could try a little more. After that I switched to an attention-based solution, which roughly is this:

DEBUG:question_keyword_classifier.model:TextVectorizationLayerAdapted.
INFO:__main__:Compiling the model.
INFO:__main__:Creating datasets.
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
text_vectorization (TextVect multiple                  0         
_________________________________________________________________
encoder (Encoder)            multiple                  1360584   
_________________________________________________________________
global_max_pooling1d (Global multiple                  0         
_________________________________________________________________
dense_6 (Dense)              multiple                  262656    
_________________________________________________________________
dropout_3 (Dropout)          multiple                  0         
_________________________________________________________________
layer_normalization_2 (Layer multiple                  1024      
_________________________________________________________________
dense_7 (Dense)              multiple                  513       
=================================================================
Total params: 1,624,777
Trainable params: 1,624,777
Non-trainable params: 0
_________________________________________________________________

The Encoder layer is composed by an Embedding layer, to which positional encodings are added, followed by a (possible stack of, but so far I've tested it just with 1) MultiHeadAttention => FeedForward layers, just like in the "classic" transformer architecture. This gives good performances (in terms of BinaryAccuracy -- ~0.98-0.99) on the test set, even if I've tested this just on subsets of the huge train/test/dev datasets.

Even if this could be too large, it's hard thinking of avoiding to load "something", being that this model or other ones, in memory at runtime.

Perhaps, a ruled based, more lean, alternative (as that one by @Threepointone4, or maybe something leveraging Spacy/NER/POS rules) could be tested against the data so that we could have an accuracy fork between multiple solutions.

I'm planning to get back on this, but hardly before a few weeks, so again excuse for being a bit late, hope what's done so far can help.

tholor commented 3 years ago

@stefanondisponibile Awesome! Thanks for the update :+1:

I can totally understand that you are limited on time and we really appreciate any help and contributions - also unfinished ones :)

We will do some experiments on our end and post an update here. If you have at some point more time and want to jump in again, just let us know.

shahrukhx01 commented 3 years ago

@tholor @stefanondisponibile I have trained a simple Gradient Boosting based classifier using Scikit-learn on the dataset by @stefanondisponibile. I'm getting an f-1 score of 0.99 on the same test set given on Kaggle, the good thing is the model is roughly 420 KB in size. Please review and let me know if I improve this model, can this be used as a baseline for QueryClassification? Here's the link to the Kaggle Notebook

stefanondisponibile commented 3 years ago

hey @shahrukhx01 that sounds promising! I'll have a closer look at the notebook later, but good job! And let's hear @tholor's feedback :)

tholor commented 3 years ago

@shahrukhx01 super nice! I only had a quick look at the notebook, but seems promising. I am wondering a bit what the model actually learns. As it's using n-gram features, it might mainly learn strong keywords like "what, how ..." and maybe the query length. However, I think even that could be a helpful first baseline. We can check how well it transfers to other datasets and domains. I think it would still be valuable to eventually have two options here and compare them - your gradientboosting approach + a small attention model like the one from @stefanondisponibile.

Would you be interested in raising a PR for your model? We would require some simple functionality to load the model from remote. We could host it in our S3 bucket if you want.

shahrukhx01 commented 3 years ago

@tholor Sure! I will raise the PR this coming weekend, also, I can also train a plain (Bi)LSTM, which would be more robust, and then we can compare results with this and @stefanondisponibile's model. Do we have a threshold for how much memory can a model take without considerably taking too many resources. Would a model of size 5-10 MBs be okay provided it gives significant robustness? Please let me know. Again, for the PR I will raise it this weekend with the current model.

tholor commented 3 years ago

Sure! I will raise the PR this coming weekend, also, I can also train a plain (Bi)LSTM, which would be more robust, and then we can compare results with this and @stefanondisponibile's model.

Awesome!

Do we have a threshold for how much memory can a model take without considerably taking too many resources. Would a model of size 5-10 MBs

I am more concerned about latency than memory. I think even a model of ~50-100 MB would still be okay if the latency is not slowing down the whole pipeline. A forward pass of a distilbert model takes roughly 0.05 seconds on a CPU (batch size 1, seq_len=256, see here) and I would probably consider this as an upper bound for an acceptable latency.

If you want to train a NN model - how about a very small transformer (e.g. TinyBERT or similar)? The advantage would be that the integration in Haystack would be pretty straightforward as most related components are already in place (model loading, tokenization, saving/loading, inference, (future) fine-tuning ...).

shahrukhx01 commented 3 years ago

Ah, I see, in that case, I will then fine-tune a small transformer model and push it on the hugging face model hub and subsequently, create a separate PR for that once done.

shahrukhx01 commented 3 years ago

I have finetuned a mini-bert classifier using first 100K train samples, I'm getting a test accuracy of .997, however, when I test the model on random phrases (not taken from dataset) it doesn't perform as intended. Although, If I take any random phrase from test set at the same time, the model classifies it correctly, I suspect there could be a data leak the way data is split, however, I could be wrong, please take look at the model and notebook and let me know how we can proceed.

Model on Huggingface Kaggle Notebook Mini Bert for Query Classification Mini Bert Query classifier code base

shahrukhx01 commented 3 years ago

@tholor @stefanondisponibile since we were generating statements by removing some tokens from questions and keeping the text as same, the ML models are learning that mapping, however, not the actual goal we are trying to achieve. I have compiled another dataset which is curated by human knowledge, here I combined the SPAADIA dataset which provides ~81K declarative sentences, and SQuaD dataset which has 131K questions. Our existing model gives accuracy of 64% (77% on questions and 17% on statements) on this new dataset since statements were artificially generated in the training quora dataset. Model results: Kaggle Notebook New Dataset: Kaggle Dataset Should I proceed with this dataset and retrain the above transformer, your thoughts?

stefanondisponibile commented 3 years ago

good point @shahrukhx01. I guess with this dataset we could see it more like "is this a question?".

shahrukhx01 commented 3 years ago

baseline model update: I have trained the same gradient boosted model however this time on the newer data, here are the scores: spaadia/squad test set: 95% Kaggle Notebook quora test set (this dataset is completely unseen by this model, this is the dataset Stefano created): 83% Kaggle Notebook

@tholor could you please see this and comment on this. I think now we have a stable baseline model. Please share your thoughts when you get time to see this.

stefanondisponibile commented 3 years ago

hi @shahrukhx01, I'd wait @tholor's opinion on this, too. Just wanted to say great job. And interesting idea using one dataset for the training and a totally different one for testing.

tholor commented 3 years ago

Hey @shahrukhx01

Great work! From your description it sounds like your first transformer was kinda over fitting on dataset 1 / the pattern that was used for creating the keyword samples in there. I will have a look at your second dataset and play around with the models on Monday. What version of the spaadia dataset did you use? Can you maybe just post 3-5 samples here or a direct download link? From your description it seems that dataset 2 has rather natural sentences / statements than keyword queries? This is also an interesting distinction but the most common use case that I have seen for this classifier is to distinguish between natural sentences (questions + statements) and classic keyword queries. Such a distinction allows the routing to a dense retriever vs BM25 .

I will come back to you with more detailed feedback once I had a chance to investigate models and dataset.

shahrukhx01 commented 3 years ago

@tholor thanks for your response I used the spaadia dataset version 2, you can download the raw dataset here and parsed dataset here. I was under the impression we want to distinguish between statements vs questions. If the goal is to distinguish between keyword queries vs (statements + questions), the existing transformer model that I fine-tuned on Stefano's dataset does the exact job. Please do have a look at it, if it satisfies the requirement, then I can move ahead and start working on the pull request for that.

tholor commented 3 years ago

Yes, the main goal is keywords vs. questions/statements. However, the other case is also useful. I will have a quick look at the models on Monday and test them on a few random samples. If all look promising we can also add all, as there are other use cases for the statement classifier as well (e.g. to decide if we run QA or not)

tholor commented 3 years ago

Hey @shahrukhx01, I had a look at the models and played around with some sample queries.

Models on Dataset 1

Gboost and Transformer both seem to give meaningful results

In my very small test set the transformer made 2 errors, the gboost model 3:


keywords = ["revenue last year", "age John", "John Snow dead", "debug kubernetes aws", "debug kubernetes aws on in",
        "on in out to", "on out", "how kubernetes"]
questions = ["How much was revenue last year?", "How old is John?", "Is this a question", "Is John Snow dead",
         "How to debug kubernetes on was"]

gboost errors

"How much was revenue last year?" "debug kubernetes aws on in", "on in out to"

transformer errors

"debug kubernetes aws on in", "on in out to"



=> It seems that they are both slightly overfitting a bit on certain stopwords rather than the grammatical correctness / structure of a sentence, i.e. if words like "on, it, with, for," are present they likely classify it as a "question" . However, I think this is okay for the first model version and could further be improved in future by adding such adversarial samples to the training data (just a bunch of random stopwords with the gold label "keywords")  
=> So let's move forward and offer both models in Haystack

**Model from Dataset 2 (spaadia)
If I didn't miss a comment of you, there was so far only the gboost model trained on this dataset, right? I had a quick look there as well and it seems to work for the (slightly different) classification task of questions vs statements. I believe this task is a bit more difficult and it seems that the classifier is very greedy on question words like "what, how, when ..." . This means it can easily tricked by queries like "show me what works well" which should rather be classified as a statement. I believe a transformer could do better here as it (hopefully) learns some additional order-based features.

**Way forward** 
- Let's add all three models. I will comment in the PR  to have all relevant comments about the actual implementation in one place.
- If you want, you could also train a mini bert on the spaadia dataset and upload it to HF. Then we have both model variants for both cases. Now that you have everything in place, it should be rather simple and there might be people out there using it.

shahrukhx01 commented 3 years ago

@tholor thanks for the detailed feedback, yes I will train a mini bert on the spaadia dataset as well soon, before raising the next PR. Also I'd look forward to your comments on the current PR and the way forward there.

tholor commented 3 years ago

Implemented in #1099

KadriMufti commented 1 year ago

Ladies and Gents, inspired by this thread I made a similar query-question classifier model for the Arabic language.

It can be found here: KadriMufti/arabic-finetuned-question-detection

I hope it is useful to the Haystack community.

The details of the model and testing results are in the model card on HuggingFace.

tholor commented 1 year ago

Hey @KadriMufti, That's awesome! Thanks for sharing it with the community :tada:

deepset-ai / haystack

Introduce QueryClassifier #611

Elasticsearch to the rescue. 🦸

Some issues around this feature:

Back to Pipelines: Binary/Dynamic Node.

Dataset.

Tests.

gboost errors

transformer errors