I came up with the current config schema in a couple of hours, while experimenting and I think we should probably put more thought into it.
Challenges to be addressed here:
How many sections do we want and what should they mean? (I have looked at SpaCy for inspiration, but it's not the same thing, so some of their stuff might not work for us)
How much tinkering do we want to allow? As in: Which hyperparameters do we want to control? Do we want to mess with the pooling layer at all, or do we just accept that pooling==mean?
How do we describe tasks in a reproducible way? My vision is that we create a config system, where we can define the different tasks and the datasets in the config, but for that we will need some sort of system/process/schema. A couple of challenges concerning this:
Open and non-open datasets: Some of the data we want to use, we can access from Huggingface Hub, or just put it up there cause it's open, which is awesome, but then we need some way of loading non-open data as well, my first thought is, we could just add load_dataset from Datasets into a registry, like this:
from confection import registry
from datasets import load_dataset
Then the config would be something like this:
```toml
[tasks]
[tasks.bornholmsk]
[tasks.bornholmsk.dataset]
@loaders="load_dataset"
path="strombergnlp/bornholmsk_parallel"
[task.bornholmsk.objective]
...
[tasks.nyheder]
[tasks.nyheder]
@loaders="load_dataset"
path="dat/nyheder.jsonl"
[task.nyheder.objective]
...
I came up with the current config schema in a couple of hours, while experimenting and I think we should probably put more thought into it.
Challenges to be addressed here:
load_dataset
from Datasets into a registry, like this:registry.loaders.register("load_dataset")(load_dataset)