centre-for-humanities-computing / dfm-sentence-transformers

Code for curating data and training sentence transformers for the Danish Foundation Models project.
MIT License
0 stars 0 forks source link

Good Config Schema #4

Closed x-tabdeveloping closed 11 months ago

x-tabdeveloping commented 11 months ago

I came up with the current config schema in a couple of hours, while experimenting and I think we should probably put more thought into it.

Challenges to be addressed here:

  1. How many sections do we want and what should they mean? (I have looked at SpaCy for inspiration, but it's not the same thing, so some of their stuff might not work for us)
  2. How much tinkering do we want to allow? As in: Which hyperparameters do we want to control? Do we want to mess with the pooling layer at all, or do we just accept that pooling==mean?
  3. How do we describe tasks in a reproducible way? My vision is that we create a config system, where we can define the different tasks and the datasets in the config, but for that we will need some sort of system/process/schema. A couple of challenges concerning this:
    • Open and non-open datasets: Some of the data we want to use, we can access from Huggingface Hub, or just put it up there cause it's open, which is awesome, but then we need some way of loading non-open data as well, my first thought is, we could just add load_dataset from Datasets into a registry, like this:
      
      from confection import registry
      from datasets import load_dataset

registry.loaders.register("load_dataset")(load_dataset)

Then the config would be something like this:
```toml
[tasks]

[tasks.bornholmsk]

[tasks.bornholmsk.dataset]
@loaders="load_dataset"
path="strombergnlp/bornholmsk_parallel"

[task.bornholmsk.objective]
...

[tasks.nyheder]

[tasks.nyheder]
@loaders="load_dataset"
path="dat/nyheder.jsonl"

[task.nyheder.objective]
...
KennethEnevoldsen commented 11 months ago
  1. You can just specify an arbitrary list as argument

I think it would look something like:

[tasks.*]
@loaders="load_dataset"
path="strombergnlp/bornholmsk_parallel"

[tasks.*]
@loaders="load_dataset"
path="dat/nyheder.jsonl"
  1. Keep it minimal, we want to get a prototype up and running before we do anything else (and mean is a good default)
  2. For non-open dataset we can load them as you specify using load_dataset but referring to a local dataset.
KennethEnevoldsen commented 11 months ago

Fixed in #5