Different language models

juks commented 1 year ago

Hi!

Thanks for this library, it really makes complex things simple.

I wonder if some extra effort is required to use other than English models for classification? (English works great out of the box)

Here is some very basic example of wrong classification. Is there a chance somehow to debug and make it work?

import spacy
import classy_classification

nlp = spacy.load("ru_core_news_lg")

data = {
    'right': 'Съешь ещё этих мягких французских булок да выпей чаю.',
    'wrong': 'Быстрая бурая лиса перепрыгивает через ленивую собаку.'
}

nlp.add_pipe(
    'text_categorizer',
    config={
        'data': data,
        'model': 'spacy'
    }
)

test = nlp('Съешь мягких булок')
print(test._.cats)

Result is: {'right': 0.45482638521673985, 'wrong': 0.5451736147832601}

It should take the 'right' label, the same way it does for English.

davidberenstein1957 commented 1 year ago

Hi, this is not working because you don't pass a list of examples for each class. Try the following.

data = {
    'right': ['Съешь ещё этих мягких французских булок да выпей чаю.'],
    'wrong': ['Быстрая бурая лиса перепрыгивает через ленивую собаку.']
}

juks commented 1 year ago

I am sorry for that. Just oversimplified the case from what is was in the actual project.

Will try to make it more related and catch the poing where it gets wrong.

juks commented 1 year ago

Looks like it happens on adding the third label ('other'). In the full version where are about 5 labels with 50 samples each.

With two labels it gets the right category even with really short sentenses.

import spacy
import classy_classification

nlp = spacy.load("ru_core_news_lg")

data = {
    'right': ['Съешь ещё этих мягких французских булок да выпей чаю.', 'Съешь мягких французских булок да выпей чаю.'],
    'wrong': ['Быстрая бурая лиса перепрыгивает через ленивую собаку.', 'Быстрая бурая лиса перепрыгивает через собаку.'],
    'other': ['Это недоразумение', 'Это какое-то недоразумение']
}

nlp.add_pipe(
    'text_categorizer',
    config={
        'data': data,
        'model': 'spacy'
    }
)

test = nlp('Бурая лиса перепрыгивает собаку')
print(test._.cats)

Result: {'right': 0.18472916718366955, 'wrong': 0.16386449966253117, 'other': 0.6514063331537993}

juks commented 1 year ago

Exactly the same sample as was used for training - also gives the wrong result:

test = nlp('Съешь ещё этих мягких французских булок да выпей чаю.')
print(test._.cats)

Result (expecting 'right'):

{'right': 0.22359858914585207, 'wrong': 0.6049628215142842, 'other': 0.17143858933986364}

davidberenstein1957 commented 1 year ago

Hi @juks I think it is caused by this issue, will release a fix today. https://github.com/davidberenstein1957/classy-classification/issues/28

davidberenstein1957 commented 1 year ago

This was fixed in version 0.6.3, cheers @juks

juks commented 1 year ago

Thanks, David!

Will try it later.

davidberenstein1957 commented 1 year ago

Great, you might also want to check https://github.com/davidberenstein1957/spacy-setfit

juks commented 1 year ago

This is really reasonable for me, since while this ticket was open I had to solve my problem somehow, so I went for Setfit.

Thanks!

davidberenstein1957 / classy-classification

Different language models #38