centre-for-humanities-computing / dfm-sentence-transformers

Code for curating data and training sentence transformers for the Danish Foundation Models project.
MIT License
0 stars 0 forks source link

New interface for tasks. #5

Closed x-tabdeveloping closed 11 months ago

x-tabdeveloping commented 11 months ago

I removed my negative sampling task. As per #3 MultipleNegativeRanking loss is going to be the default. Tasks can be easily added to the registry in the future if we need it.

Multiple tasks are now using sentence_transformers' default training regime. Tasks, that have the same loss are merged, so that training examples are sampled from a mixture of data sets. This is achieved by grouping tasks by their string representation, which should be the same if the loss function is the same. For example a snippet from the implementation of MultipleNegativesRanking:

class MultipleNegativesRanking(Task):
    ***
    def __str__(self):
        return f"MultipleNegativesRanking(scale={self.scale})"

Tasks are then grouped as such:

from itertools import groupby
from dfm_sentence_trf.tasks import Task

tasks: list[Task]
for loss_name, group in groupby(tasks, str):
    pass

We can't just use @dataclass because tasks also take the dataset and dataset related arguments.

Datasets for tasks can be loaded with :hugs: Datasets' load_dataset() function when describing tasks in the configuration file. As such both local and remote datasets can be loaded. #4

Example:

[tasks.hestenettet]
@tasks="multiple_negatives_ranking"
sentence1="question"
sentence2="answer"

[tasks.hestenettet.dataset]
@loaders="load_dataset"
path="some/local/file.jsonl"

Unfortunately due to validation errors, I couldn't just put the original function into the registry, here is the snippet that does it:

registry.loaders = catalogue.create(
    "confection", "loaders", entry_points=False
)

@registry.loaders.register("load_dataset")
def _load_dataset(
    path: str, name: Optional[str] = None
) -> Union[Dataset, DatasetDict]:
    return load_dataset(path, name=name)  # type: ignore

This might have to change in the future if we intend to use other arguments as well.

x-tabdeveloping commented 11 months ago

You've probably seen it but I added two more tasks for STS and NLI, both of them can be used in the config system from now on, so if we intend to add data for these tasks, we just add one more task in the config. :hugs: