davidberenstein1957 / spacy-setfit

This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.
Apache License 2.0
72 stars 5 forks source link

Example Configuration for Multi-Label? #10

Closed Insighttful closed 5 months ago

Insighttful commented 11 months ago

Wondering about multilabel...

TLDR: Can you provide any pointers on how to do multilabel properly using spacy-setfit?

Without explicitly specifying:

I note the logging when training, that even though 3 labels are provided, multi_label self-configures to False:

The datasets have been created:                                                                                                                                                                                                                                                    schemas.py:119
                            labels: ['CAT_A', 'CAT_B', 'CAT_C']                                                                                                                                                                                                                          
                            multi_label: False                                                                                                                                                                                                                                                                       
                            train_dataset: 1458                                                                                                                                                                                                                                                                      
                            eval_dataset: 162                                                                                                                                                                                                                                                                        
config.json not found in HuggingFace Hub.

However, the model appears to train successfully: [14:46:49] INFO evaluation: {'accuracy': 0.9265432098765432}

Attempting to explicitly specify:

The Setfit repo docs show the setup for multilabel: Training on multilabel datasets

from setfit import SetFitModel

model = SetFitModel.from_pretrained(
    model_id,
    multi_target_strategy="one-vs-rest",
)

Attempting to similarly implement using spacy-setfit results in an config validation error:

    nlp.add_pipe(
        "text_categorizer",
        config={
            "pretrained_model_name_or_path": "paraphrase-MiniLM-L3-v2",
            "multi_target_strategy": "one-vs-rest",
            "setfit_trainer_args": {
                "train_dataset": train_dataset,
                "eval_dataset": eval_dataset,
                "metric": "accuracy",
                "num_iterations": 20,
                "epochs": 1,
                "samples_per_label": 2,
                "column_mapping": {"text": "text", "label": "label"},
            },
        },
    )
Config validation error
text_categorizer -> multi_target_strategy       extra fields not permitted
Insighttful commented 11 months ago

Updated for V0.1.5

Noting v0.1.5 has a setfit_trainer_args bool flag called 'multi_label' and now factory_name should be 'spacy_setfit' instead of 'text_categorizer'.

    # Add the "text_categorizer" spaCy pipeline, and configure with SetFit parameters:
    nlp.add_pipe(
        "spacy_setfit",  # 'text_categorizer' becomes 'spacy_setfit'
        config={
            "pretrained_model_name_or_path": "paraphrase-MiniLM-L3-v2",
            "setfit_trainer_args": {
                "train_dataset": train_dataset,
                "eval_dataset": eval_dataset,
                "multi_label": True,  # <- Using multi_label here!
                "metric": "accuracy",
                "num_iterations": 20,
                "num_epochs": 1,
                "batch_size": 16,
                "samples_per_label": 3,
                "column_mapping": {"text": "text", "label": "label"},
            },
        },
    )

Error when multi_label is True

However, when multi_label is True, the trainer losses access to the training data.

Expand for details:

ValueError: num_samples should be a positive integer value, but got num_samples=0 ```python

File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/spacy/language.py", line 814, in add_pipe pipe_component = self.create_pipe( ^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/spacy/language.py", line 702, in createpipe resolved = registry.resolve(cfg, validate=validate) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/confection/init.py", line 756, in resolve resolved, = cls._make( ^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/confection/init.py", line 805, in make filled, , resolved = cls._fill( ^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/confection/init.py", line 877, in _fill getter_result = getter(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/spacy_setfit/init.py", line 42, in create_setfit_model return SpacySetFit.from_trained( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/spacy_setfit/models.py", line 73, in from_trained trainer.train() File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/setfit/trainer.py", line 384, in train train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 351, in init sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/torch/utils/data/sampler.py", line 107, in init raise ValueError("num_samples should be a positive integer " ValueError: num_samples should be a positive integer value, but got num_samples=0


</details>

Specifically, in the [setfit package's trainer.py we get to a point where the Dataloader attempts](https://github.com/huggingface/setfit/blob/cbc01ec402e86ca04e5e40e9bce7f618f3c2946c/src/setfit/trainer.py#L371) to load data in a multi-label context, but doesn't have access to the parameters it expects, OR doesn't load the data properly. The result of the data load is 0 records, even though it is observed by debugging the object has access to the labels and features as evidenced here:

![image](https://github.com/davidberenstein1957/spacy-setfit/assets/116014557/d0e5f458-1d21-451d-9fbf-fc36cb184df9)