Adding a new question to a QnA item can change response to an existing question

Expected Behavior

Adding new questions to an item should not adversely affect any answers that were previously correctly matched.

Actual Behavior

Adding a new question can actually weaken the score of an existing good match, sometimes causing another item to have a stronger score. See example below.

Steps to Reproduce the Problem

Import two items in attached file: test.txt

Switch to 'test' tab, and test question "tell me about snorkeling" - observe that the expected item is 'test.001' has the higher score (though by a slim margin)

Now edit 'test.001' and add a second question: "what should I know about snorkeling"
Rerun the test with the same question.. Now the other answer has the higher score.

It is counter intuitive, and undesirable, that adding the second question would change the answer.

Analysis

The QnABot uses elastic search ‘full text search’ capability to create ‘relevance scores’ for each QnA item. Relevance scores are computed by weighting a number of different factors in an effort to get the best match – see What is relevance

There are three factors to the scoring a) term frequency, b) inverse document frequency, and 3) field length norm.. I believe it is this third factor that is biting us here.

This is because the effect of adding the second question to the first item was to make the whole ‘question’ field longer, which reduced the relevance score of the match of item 1 (die to the ‘field length norm’ behavior aforementioned). The score was reduced to the point where it was slightly lower than the first item.

NOTE: This situation really only arises when the question results in very similar scores on multiple items.. Ie when similarilties between the questions prevent a strong unique match.

Options

A shortterm workaround can be to duplicate the question in order to increase the 'terms frequency' part of the scoring equation. ie adding a 3rd question to 'test.001' duplicating the initial question "tell me about snorkelling" once again results in this item having the highest score. Despite the fact that the 3rd question also lengthened the field further, the fact that it was a strong match to the question had the net effect of strengthening the overall score. However, while this technique might be useful for avoiding this specific problem, I do worry that it could introduce new problems by weakening the score of other question variants. Could become a game of 'whack-a-mole'!

Better if we can fix a fix in the code 1) (preferred) find a way to construct the doctype mapping or the query to negate the problematic 'field length norm' factor when matching on the question lists. The total number of or length of the questions should ideally not affect the scoring of a match. See disable field-length norm in mapping

UPDATE: disabling 'field-length norm' did resolve this issue but created a new issue.. namely with it disabled, the question "tell me about snorkeling" returned identical relevance scores for the two test items "tell me about snorkeling' and 'tell me about snorkel prices' since after stemming (snorkeling=snorkel) the matches were identical. So back to the drawing board. Next up - see if mapping the question array as a nested datatype will help.
2) alternatively, enhance the elastic search document structure to model each question independently, either by duplicating answers where there are multiple questions, or using a parent/child nested mapping to next questions as separate documents under the parent answer.

Using a nested datatype for questions seems to be the way to go. With this approach, the terms are matched separately against each question.

Mapping for nested questions:

curl -H'Content-Type: application/json' -XPUT "$ESURL/qna-index" -d '{
  "mappings": {
    "qna": {
                   "properties":{
                        "qid":{"type":"keyword"},
                        "question":{
                            "type":"nested"
                            },
                        "a":{
                            "type":"text",
                            "analyzer":"english"
                        },
                        "r":{"properties":{
                            "attachmentLinkUrl":{"type":"keyword"},
                            "buttons":{"properties":{
                                "text":{"type":"text"},
                                "value":{"type":"keyword"}
                            }},
                            "imageUrl":{"type":"keyword"},
                            "subTitle":{"type":"text"},
                            "title":{"type":"text"}
                        }}
                    }
                }
            }
}'

Test Data:

curl -H'Content-Type: application/json' -XPUT "$ESURL/qna-index/qna/test.001" -d '{
         "question": [
            {"q":"tell me about snorkeling"}
         ],
         "a": "Snorkeling is cool!",
         "r": {
            "title": "",
            "imageUrl": ""
         },
         "qid": "test.001"
}'

curl -H'Content-Type: application/json' -XPUT "$ESURL/qna-index/qna/test.002" -d '{
         "question": [
            {"q":"tell me about snorkel prices"}
         ],
         "a": "Snorkels are not expensive",
         "r": {
            "title": "",
            "imageUrl": ""
         },
         "qid": "test.002"
}'

Add more questions to test.001 to check that we still get correct answer:

curl -H'Content-Type: application/json' -XPUT "$ESURL/qna-index/qna/test.001" -d '{
         "question": [
            {"q":"tell me about snorkeling"},
            {"q":"nothing in common"},
            {"q":"something else again"}
         ],
         "a": "Snorkeling is cool!",
         "r": {
            "title": "",
            "imageUrl": ""
         },
         "qid": "test.001"
}'

Test query:

Notes:

Important to use ?search_type=dfs_query_then_fetch with small numbers of objects to combine idf across shards. (QnABot handler already does this)
score_mode:max used to return the score of the strongest question match, and avoid diluting a strong match with other weaker matches.
boost used to give double weighting to matches on question, compared to matches on answer (previously implemented with multi-match 'fields' syntax.
try adding more questions to either test.001 or test.002 to ensure that adding more questions doesn't ever result in the wrong answer being ranked higher.

curl -H'Content-Type: application/json' -XPOST "$ESURL/qna-index/qna/_search?search_type=dfs_query_then_fetch" -d '{  
   "query":{  
      "bool":{  
         "should":[  
            {  
               "nested":{  
                  "path":"question",
                  "score_mode":"max",
                  "boost":2,
                  "query":{  
                     "match":{  
                        "question.q":"tell me about snorkeling"
                     }
                  }
               }
            },
            {
                "match":{
                    "a":"tell me about snorkeling"
                }
            },
            {
                 "match":{
                    "t":"topicvalue"
                }
            }     
         ]
      }
   }
}'

This change will modify the JSON structure for documents. The content designer 'Import' function should support previous JSON structure for backward compatibility and to allow content migration, however i think we can migrate export and import to new nested structure going forward.

We should probably also take this opportunity to rename the fields in the document JSON to replace fields "a", "q", "t", "r" with more explicit longform names that better reflect the field meaning.

aws-solutions / qnabot-on-aws