Open stefan-it opened 1 year ago
For me this sounds a bit like we need a real database with GUI and dataset versioning.
Like this: https://github.com/bigscience-workshop/promptsource :thinking:
Ok, here's my favorite one: https://twitter.com/dvilasuero/status/1641164559888142336 (powered by @dvsrepo) :hugs:
We use Argilla now: https://github.com/LEL-A/GerAlpacaDataCleaned/pull/6
It also allows us to add metadata (translation model and original id) as well as sentence embeddings. Argilla itself allows us to label/flag a certain example into several categories, which can be seen as more sophisticated as just a review_needed
flag.
Have you seen https://github.com/thisserand/alpaca-lora-finetune-language? There the dataset was translated into German via Google Translator, DeepL and GPT 3.5. What do you think about including them in the dataset as different translation versions?
Hi,
it would be a great improvement, if the translated dataset can be enriched with more data or fields:
instruction
,input
andoutput
) can be included to have a better comparison of original and translated data.review_needed
should be added. Problematic or wrong examples can be detected (automatically or manual) and can then be flagged.On Slack we had the discussion about markdown tables. So one could easily write a markdown table detection script and flag the found examples with the
review_needed
option, so that these examples can be reviewed later.Another issue to be discussed: do we want to "override" the existing translated_german_alpaca.json? Or should we introduce a new file for that? But is more than one "dataset" confusing?
Concrete implementation
Concrete implementation steps would be to introduce the following new keys for each example in the dataset:
instruction
input
output
review_needed
(Boolean, default:false
).translations
withinstruction
,input
andoutput
as keysProof of concept
One example entry of that enriched dataset could look like: