Closed juks closed 1 year ago
Hi, this is not working because you don't pass a list of examples for each class. Try the following.
data = {
'right': ['Съешь ещё этих мягких французских булок да выпей чаю.'],
'wrong': ['Быстрая бурая лиса перепрыгивает через ленивую собаку.']
}
I am sorry for that. Just oversimplified the case from what is was in the actual project.
Will try to make it more related and catch the poing where it gets wrong.
Looks like it happens on adding the third label ('other'). In the full version where are about 5 labels with 50 samples each.
With two labels it gets the right category even with really short sentenses.
import spacy
import classy_classification
nlp = spacy.load("ru_core_news_lg")
data = {
'right': ['Съешь ещё этих мягких французских булок да выпей чаю.', 'Съешь мягких французских булок да выпей чаю.'],
'wrong': ['Быстрая бурая лиса перепрыгивает через ленивую собаку.', 'Быстрая бурая лиса перепрыгивает через собаку.'],
'other': ['Это недоразумение', 'Это какое-то недоразумение']
}
nlp.add_pipe(
'text_categorizer',
config={
'data': data,
'model': 'spacy'
}
)
test = nlp('Бурая лиса перепрыгивает собаку')
print(test._.cats)
Result:
{'right': 0.18472916718366955, 'wrong': 0.16386449966253117, 'other': 0.6514063331537993}
Exactly the same sample as was used for training - also gives the wrong result:
test = nlp('Съешь ещё этих мягких французских булок да выпей чаю.')
print(test._.cats)
Result (expecting 'right'):
{'right': 0.22359858914585207, 'wrong': 0.6049628215142842, 'other': 0.17143858933986364}
Hi @juks I think it is caused by this issue, will release a fix today. https://github.com/davidberenstein1957/classy-classification/issues/28
This was fixed in version 0.6.3, cheers @juks
Thanks, David!
Will try it later.
Great, you might also want to check https://github.com/davidberenstein1957/spacy-setfit
This is really reasonable for me, since while this ticket was open I had to solve my problem somehow, so I went for Setfit.
Thanks!
Hi!
Thanks for this library, it really makes complex things simple.
I wonder if some extra effort is required to use other than English models for classification? (English works great out of the box)
Here is some very basic example of wrong classification. Is there a chance somehow to debug and make it work?
Result is:
{'right': 0.45482638521673985, 'wrong': 0.5451736147832601}
It should take the 'right' label, the same way it does for English.