Closed Insighttful closed 5 months ago
Noting v0.1.5 has a setfit_trainer_args
bool flag called 'multi_label' and now factory_name
should be 'spacy_setfit' instead of 'text_categorizer'.
# Add the "text_categorizer" spaCy pipeline, and configure with SetFit parameters:
nlp.add_pipe(
"spacy_setfit", # 'text_categorizer' becomes 'spacy_setfit'
config={
"pretrained_model_name_or_path": "paraphrase-MiniLM-L3-v2",
"setfit_trainer_args": {
"train_dataset": train_dataset,
"eval_dataset": eval_dataset,
"multi_label": True, # <- Using multi_label here!
"metric": "accuracy",
"num_iterations": 20,
"num_epochs": 1,
"batch_size": 16,
"samples_per_label": 3,
"column_mapping": {"text": "text", "label": "label"},
},
},
)
However, when multi_label
is True, the trainer losses access to the training data.
Expand for details:
File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/spacy/language.py", line 814, in add_pipe pipe_component = self.create_pipe( ^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/spacy/language.py", line 702, in createpipe resolved = registry.resolve(cfg, validate=validate) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/confection/init.py", line 756, in resolve resolved, = cls._make( ^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/confection/init.py", line 805, in make filled, , resolved = cls._fill( ^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/confection/init.py", line 877, in _fill getter_result = getter(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/spacy_setfit/init.py", line 42, in create_setfit_model return SpacySetFit.from_trained( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/spacy_setfit/models.py", line 73, in from_trained trainer.train() File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/setfit/trainer.py", line 384, in train train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 351, in init sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/anon/Developer/proj/.venv/lib/python3.11/site-packages/torch/utils/data/sampler.py", line 107, in init raise ValueError("num_samples should be a positive integer " ValueError: num_samples should be a positive integer value, but got num_samples=0
</details>
Specifically, in the [setfit package's trainer.py we get to a point where the Dataloader attempts](https://github.com/huggingface/setfit/blob/cbc01ec402e86ca04e5e40e9bce7f618f3c2946c/src/setfit/trainer.py#L371) to load data in a multi-label context, but doesn't have access to the parameters it expects, OR doesn't load the data properly. The result of the data load is 0 records, even though it is observed by debugging the object has access to the labels and features as evidenced here:
![image](https://github.com/davidberenstein1957/spacy-setfit/assets/116014557/d0e5f458-1d21-451d-9fbf-fc36cb184df9)
Wondering about multilabel...
TLDR: Can you provide any pointers on how to do multilabel properly using spacy-setfit?
Without explicitly specifying:
I note the logging when training, that even though 3 labels are provided,
multi_label
self-configures to False:However, the model appears to train successfully:
[14:46:49] INFO evaluation: {'accuracy': 0.9265432098765432}
Attempting to explicitly specify:
The Setfit repo docs show the setup for multilabel: Training on multilabel datasets
Attempting to similarly implement using spacy-setfit results in an config validation error: