MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.2k stars 21.35k forks source link

Confidence score differences section #41885

Closed v-kydela closed 4 years ago

v-kydela commented 4 years ago

I'm very confused by this section and I have many questions.

the content of the test and the published knowledge base are located in different Azure Search indexes

How can you tell what these Azure Search indexes are? Are both indexes in the Search Service that gets created with your QnA Maker resource? And why would different Azure Search indexes give different confidence scores? Shouldn't they operate the same way?

If you have a knowledge base in different regions, each region uses its own Azure Search index

What would it mean for one knowledge base to be in different regions? Would that be if you export the knowledge base and then import it to make a copy?


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

YutongTie-MSFT commented 4 years ago

@v-kydela Thanks for the feedback! We are currently investigating and will update you shortly.

YutongTie-MSFT commented 4 years ago

@diberry Hi Dina, do you have any clue about this issue? Please share if you have any. Thanks.

diberry commented 4 years ago

@v-kydela Hi Kyle - You've pointed out a good mismatch of information spread across multiple doc pages. For a QnA Service's associated search service, all the test kbs are in the test index. When you publish the kb, a version is moved to its own index. A query can be different from between test and publish because in interactive test, the top is set for you to a specific value, while you have to create your own query params against publish and one choice is not to set the top at all. A best practice is to set top to 30 on the publish query - however if you are using bot framework, I'm not sure how to set top there.

In terms of a knowledge base per region, yes, you would create the knowledge base in one region then export/import into any other regions you need.

Does this answer all your questions?

in-progress

v-kydela commented 4 years ago

Thank you for help, @diberry

I'm still not understanding why any of this would generate different confidence scores. It sounds like you're saying the operative difference between the test index and a published index is that the test index sets a top value automatically. But why would the top value influence confidence scores? If a search query returns answers with scores of 75, 50, and 25, and you select the top 2, shouldn't that give you 2 answers with scores of 75 and 50, meaning the confidence scores of the top 2 results are the same in both cases? And even if the confidence scores are influenced by the value of top, shouldn't that mean the published index would yield the same result as the test index if you manually set top to the same value that the test index sets it to? And how would this explanation account for confidence score differences between regions?

I'm also not understanding the relationship between a search "index" and an Azure Search service resource. Do the test index and your published indexes all go in the same Search service? Is there some way to see the indexes in the Azure portal?

diberry commented 4 years ago

@v-kydela If you create your QnA Maker resource in a new resource group, that resource group will also have an Azure Search service and an Azure web app. All KBs created on the service will also use those other two resources. All test kbs with be in the test index - so there are more rows in the test index than the publised kb's index.

There is a two-part filtration for the query, implemented with the rankers. The search service ranker returns a total number of requests, using the top. So if you have a query and a top of 30, then 30 results are returned from the search index and passed to the QnA Maker nlp ranker. If, however, you don't use top or set it to a low number like 1, then only the top 1 element found from the search service can be passed to the QnA nlp ranker.

So if you ask for a top of 30, but only 3 items match, the QnA Maker's ranker has only those 3 items to choose from for best answer, or matching answer, or whatever criteria you have placed on the query (multi-turn, tags).

If you don't set top or set it to a low number, then search index only returns that number of results, and QnA Maker has only host answers from which to find an answer.

If you don't set top for a query to the published kb, you are --from the very beginning -- not requesting the same query, because the interactive test panel uses top by default.

Does this make sense?

v-kydela commented 4 years ago

@diberry - Okay, let me see if I understand correctly. There are two rankers: the search service ranker and the QnA Maker NLP ranker. The search service ranker goes first, and that's where top is applied. The query that's passed to the search service ranker is the user's "question" that QnA Maker is supposed to answer. If top is 10 for example then the search service ranker will search the KB (where each row contains a different question/answer pair) and return the 10 closest matches it can find. So "What does QnA Maker do?" might return rows for "What is the purpose of QnA Maker?" and "How does QnA Maker work?" etc.

Those top 10 rows are then passed in as the input to the Qna Maker NLP ranker. While the search service ranker has its own sort of AI for determining which rows come close to the question, the QnA Maker NLP ranker is responsible for determining just how close each row actually is. Since the QnA Maker NLP ranker AI is more sophisticated than the search service ranker AI, it might end up ordering the rows differently. So if the search service ranker returns rows A, B, and C as the top 3 in that order, the QnA Maker NLP ranker might give the scores 60 to C, 50 to A, and 40 to B. While the search service ranker has to judge every row in the KB to know what to return, the QnA Maker NLP ranker only has to look at as many rows as specified by top, so it's like taking a closer look at the results.

So I'm guessing there are two reasons the top parameter influences the confidence scores:

  1. The search service ranker might misjudge the rows in such a way that the actual best row doesn't get passed on the the QnA Maker NLP ranker, and so the best answer would end up missing from the results because the more powerful AI never got a chance to examine it.
  2. The QnA Maker NLP ranker generates confidence scores partly by comparing the rows to each other, so the scores could be considered "relative" in a way. This would mean that even if the search service ranker passes on the same top 10 rows that the QnA Maker NLP ranker would select as the top 10 no matter what the top value was, it would give those top 10 rows different confidence scores if it had more rows to compare them to. So if C is 60 and A is 50 and B is 40 when top is 10, if top were 30 instead then B might be 70 and C might be 65 and A might be 55. Those same rows might be given different scores just because of the other rows that are being compared with them. I have since discovered that the QnA Maker NLP ranker does not compare the rows to each other.

Is my understanding correct?

diberry commented 4 years ago

@v-kydela - This all sounds plausible but may not be exactly, in the particulars, correct. Is this question to do with BF somehow?

v-kydela commented 4 years ago

@dinaberry - Not necessarily, but I want to make sure I have a clear understanding of confidence score differences because customers often report this kind of problem.

diberry commented 4 years ago

@v-kydela Based on yesterday's meeting, what issues do you think the doc has with the confidence score explanations?

v-kydela commented 4 years ago

@diberry - The documentation seems to be misleading because it gives users the impression that the primary reason for score differences is different indexes/regions, even though there's nothing inherent about using a different index that would generate a different score. The underlying reasons are not given, and so customers won't be able to come up with solutions on their own. However, the documentation could be left alone if you want to continue to leave out implementation details.

diberry commented 4 years ago

@v-kydela - what concepts do you feel would help the customer - which diving into implementation details that may change?

v-kydela commented 4 years ago

@diberry - It would help the customer to know that test KB's share an index whereas production KB's don't, and that affects confidence scores by affecting word frequencies. It would help customers to know that they can fix that particular problem by having only one KB in their resource, or by using a staging KB in its own production index for testing.

It's also important to let customers know that they need to make sure they use the same parameters (especially top) when they query the test and production indexes if they want to make sure they get the same results.

diberry commented 4 years ago

Fix will be published in next 24 hours. #please-close