deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.86k stars 1.93k forks source link

`TransformersDocumentClassifier`: inconsistent output between ordinary and zero-shot classification #3167

Closed anakin87 closed 2 years ago

anakin87 commented 2 years ago

Describe the bug When using TransformersDocumentClassifier, the output structure is different between ordinary and zero-shot classification. For example, for the same document, we have these different outputs for doc.meta['classification']

Classification: {'label': 'joy', 'score': 0.9433773756027222}

Zero-shot classification:

{'sequence': "The soundtrack album for the second season of HBO series ''Game of Thrones'', titled '''''Game of Thrones: Season 2''''', was published on June 19, 2012. The instrumental music by Ramin Djawadi was performed by the Czech Film Orchestra and Choir and recorded at the Rudolfinum concert hall in Prague.",
 'labels': ['music', 'history', 'natural language processing'],
 'scores': [0.953018069267273, 0.023982945829629898, 0.02299901656806469],
 'label': 'music'}

Expected behavior I think that the output structure should always be the same.

To Reproduce

from haystack import Document
from haystack.nodes import TransformersDocumentClassifier

doc = Document.from_dict(
    {'content': "The soundtrack album for the second season of HBO series ''Game of Thrones'', titled '''''Game of Thrones: Season 2''''', was published on June 19, 2012. The instrumental music by Ramin Djawadi was performed by the Czech Film Orchestra and Choir and recorded at the Rudolfinum concert hall in Prague.",
 'content_type': 'text',
 'meta': {'name': '25_Game_of_Thrones__Season_2__soundtrack_.txt'
  }})

doc_classifier = TransformersDocumentClassifier(model_name_or_path="bhadresh-savani/distilbert-base-uncased-emotion")
doc_classifier_zero_shot = TransformersDocumentClassifier(
    model_name_or_path="cross-encoder/nli-distilroberta-base",
    task="zero-shot-classification",
    labels=["music", "natural language processing", "history"]
)

print(doc_classifier.predict([doc])[0].meta['classification'])
print(doc_classifier_zero_shot.predict([doc])[0].meta['classification'])

FAQ Check

System:

How do we want to tackle this issue? Can the output structure for ordinary classification (similar to that reported in docstrings/documentation) be considered the correct one?

anakin87 commented 2 years ago

@ZanSara any thoughts on this? If you agree with my approach (adapt the output of the zero-shot classification to the ordinary output), I can take charge of this issue.

ZanSara commented 2 years ago

Hey @anakin87! Actually I believe the zero-shot output to be more informative! We can still generalize by adapting the regular classifier's output to the zero-shot one. Something like:

Regular:

{
  'labels': ['music'],
  'scores': [0.953018069267273],
}

and zero-shot:

{
  'labels': ['music', 'history', 'natural language processing'],
  'scores': [0.953018069267273, 0.023982945829629898, 0.02299901656806469],
}

What do you think? (We should update the docstring in this case ofc).

If you think there's a risk of making this dict huge and noisy due to the presence of thousands of labels, we can also add a filtering param, like a top_k or a threshold, to the class and reduce the labels list that way.

anakin87 commented 2 years ago

I'm a bit unsure 🤔

I agree that the zero-shot output conveys more information. Nonetheless, for filtering purposes, the regular output seems more suitable (see docs).

I'm also thinking of how to adapt this possible output to fix #3019...

@ZanSara If you come out with a conclusive idea, please let me know...

ZanSara commented 2 years ago

After all I think this is a case where both options are good enough. It won't be an issue to stick with your idea, but I'll leave the last word to @masci to make sure.

anakin87 commented 2 years ago

Another option for doc.meta['classification'] could be like the following:

{
  'label': 'music',
  'details': {
                 'music': 0.953018069267273,
                 'history': 0.023982945829629898,
                 'natural language processing': 0.02299901656806469
                 }
}

The label field may allow straight-forward filtering/routing, while details would contain complete information on the classification.

ZanSara commented 2 years ago

I like this last approach! Let's use that :+1: