edchengg / gollie-transfusion

1 stars 1 forks source link

Training with new data #1

Open cerabinowitz opened 1 month ago

cerabinowitz commented 1 month ago

How can I train with a different language that gollie-transfusion did not train on yet?

edchengg commented 1 month ago

Hi, thanks for your question:

You can create training data by translate English data to your target language and project labels using EasyProject (NLLB checkpoint at: https://huggingface.co/ychenNLP/nllb-200-3.3B-easyproject). The basic idea is to encode labels using special markers to the English sentence and then run NLLB model to translate the sentence to your target language. Then use regular expression to derive the labels for each token. Only [1] France [/1] and [2] Britain [/2] backed Fischer ’s proposal. --> ఫిషర్ ప్రతిపాదనకు [1] ఫ్రాన్స్ [/1] మరియు [2] బ్రిటన్ [/2] మాత్రమే మద్దతు ఇచ్చాయి.

1.1 Data encoding: src/tasks/wnut/easyproject_data.py 1.2 Translation and projection: src/run_easyproject.py 1.3 Label decoding (data loader class contains code to process translation data): src/tasks/wnut/data_loader.py

cerabinowitz commented 1 month ago

Thank you so much! How does one use src/tasks/wnut/easyproject_data.py for data encoding? There doesn't seem to be any options of how to use it.

edchengg commented 1 month ago

Sorry for not being clear! You can call the encode_data() function and save the output as jsonl file (each line is a dict). I will try to prepare a script to demo this.

cerabinowitz commented 1 month ago

Does the encode_data() function take in data from another language?