EshaanT / X-InSTA

7 stars 1 forks source link

Wanna code for Automated aligner generation(mT5) #3

Open shuimushuimu opened 1 month ago

shuimushuimu commented 1 month ago

Hi, I am reproducing your paper. Could you provide the code for Automated aligner generation (mT5)?

EshaanT commented 1 month ago

Hey, sorry for the late response. Yes, you are correct the random setup selects demonstrations from any language other than the target language. But, this is not the startup we used to compare random prompting and X-Insta. I used the src_is_cross setup to run that random selection baseline, as you'll see that don't select a random source language. While working on this project I had actually tried to explore if language order is also something that can help in ICL. For example, imagine your target is Mandarin and you have multiple source languages (English, Spanish, or German). In such a scenario is there an optimum way to structure the transitions B/W languages to help cross-task prompting? The idea stemmed from language family trees and whether they can be used to help guide the model. However, we did not explore this aspect a lot because there weren't many variations in language family in our datasets, so we restrained ourselves with one-to-one pairing b/w src and target languages. I would edit out that code. Sorry for the confusion with regards Eshaan Tanwar

On Sun, May 26, 2024 at 12:50 PM shuimushuimu @.***> wrote:

Hi, I am reproducing your paper. I see that your paper makes a comparison between X-InSTA and random prompting. While in your code utils/data.py, when I choose set_up as "random", the logic is as follows:

def create_few_shots(dataset_name='amaz_bi',src_l=None,k=16,seeds=[32,5,232,100,42],set_up='src_is_cross'):

"""

:param dataset_name: name of the csv file :param src_l: a list of languages to sample from, if None passed we sample from all languages other than target :param k: number of samples to take default 16 :param seeds: 5 seeds default to [32,5,232,100,42] :param set_up: name of the demonstration sampling technique :return: Saves a json list of dictionary of the form [ {'input':text,'demonstrations': {dict of k-shots of text-demonstration pairs}, 'output': label}] """

train_set=f'data/processed/{dataset_name}_train.csv' test_set = f'data/processed/{dataset_name}_test.csv'

if set_up in ['sim_in_cross']: from sentence_transformers import SentenceTransformer, util import torch embedder = SentenceTransformer('distiluse-base-multilingual-cased-v1')

for s in seeds: df_train = pd.read_csv(train_set) df_test = pd.read_csv(test_set)

print(f'For seed {s} and dataset {dataset_name} crating {k} few shot of {set_up} set_up')

languages=task2lang[dataset_name]

for l in languages:

    if set_up=='random':

        src='all'

        """Sample demonstration from all language other than l"""
        df_train_l=df_train[~df_train['language'].isin([l])].reset_index(drop=True)
        df_test_l=df_test[df_test['language'].isin([l])].reset_index(drop=True)

        """Making sure we get same number of label of each kind"""

        demo_df=sample_from_dataframe(df_train_l,k,seed=s)
        assert len(demo_df) == k

        """Converting into our standard form"""
        test_final=input_form_converter(dataset_name,df_test_l,demo_df)
        save_file(test_final,dataset_name,l,set_up,src,s,k)

.......

This indicates that for example, when your source language is English, your presentation can be in any other languages, which does not match your paper. Did you upload the wrong version of the code?

— Reply to this email directly, view it on GitHub https://github.com/EshaanT/X-InSTA/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASZJ4X5HAF4YURSJSLHMEHTZEGELNAVCNFSM6AAAAABIJR5EL2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGMYTONJUGIYDGNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

shuimushuimu commented 4 weeks ago

Thank you for your answering! I have understood your logic and can you provide your code for Automated aligner generation (mT5)? I can not found its function for predicting masked tokens!