[BUG/ERROR] Interim output for `analyze` mode doesn't match description

forrestdavis commented 1 week ago

Describe the bug

The byROI file is stated to be a "TSV file with one row for each sentence. The columns are the mean and probability and surprisal aggregated over the ROI for the corresponding sentence." Presently the probability isn't returned and, more importantly, some sentences are dropped. So we are not getting a row for each sentence in a transparent way. I assume there is some aggregation happening over lemmas or conditions. It would be good to minimally surface the sentid so the output could be aligned. This was pointed out to me by @gia-kalro

To Reproduce

Steps to reproduce the behavior:

Navigate to NLPScholar
Download the following data file simple_agrmt_all.txt
Create the following config

exp: MinimalPair

mode:
    - evaluate
    - analyze

models:
        hf_masked_model:
          - bert-large-cased

datafpath: simple_agrmt_all.txt
predfpath: pred_simple_agrmt_all.tsv
resultsfpath: res_simple_agrmt_all.tsv

Run main.py
Note that in res_simple_agrmt_all.tsv that there is not one row for each sentence

Expected behavior

Expected one row per sentence

Observed behavior

Missing rows

Setup (please complete the following information)

OS: macOS

forrestdavis commented 1 week ago

A temporary solution, for single ROIs (like 2 not 2,3) is copied below. Simply change the file names at the bottom.

import pandas as pd

def token_to_word(preddat):

    # Change word positions of punctuation
    handle_punctuation(preddat)

    # Summarize over word position
    groupby_cols = ['sentid', 'wordpos_mod', 'model']

    summ = preddat.groupby(groupby_cols).agg({'surp': 'sum'}).reset_index() #note prob is always mean, cannot be sum.

    # Realign wordpositions (remove gaps from handling punctuation)

    summ = summ.groupby(['model']).apply(remove_gaps, colname = 'wordpos_mod', include_groups=False).reset_index()

    return summ

def remove_gaps(grouped_df, colname):

    prev=-1 #manually keep track because row indices not consecutive in grouped dataframe

    for i, row in grouped_df.iterrows():
        if prev != -1: #skip first row in grouped_df
            diff = row[colname]-grouped_df.loc[prev, colname]
            if diff>1:
                grouped_df.loc[i,colname]-=(diff-1)

        prev=i
    return grouped_df

def handle_punctuation(dat):
    dat['wordpos_mod'] = dat['wordpos']
    dat.loc[dat['punctuation']==True, 'wordpos_mod'] = dat.loc[dat['punctuation']==True, 'wordpos']-1

def create_summary(evaluate_fname, conditions_fname,
                   roi_fname):

    cond_data = pd.read_csv(conditions_fname, sep='\t')
    eval_data = pd.read_csv(evaluate_fname, sep='\t')

    by_word = token_to_word(eval_data)

    new_data = {}
    MODELS = set()
    for _, row in cond_data.iterrows():
        sentid = row['sentid']
        ROI = row['ROI']
        surp = by_word[by_word['sentid'] == sentid]
        if 'sentid' not in new_data:
            new_data['sentid'] = []
        new_data['sentid'].append(sentid)
        surp = surp[surp['wordpos_mod'] == ROI]

        models = surp['model'].tolist()
        surps = surp['surp'].tolist()
        for model, surp in zip(models, surps):
            MODELS.add(model)
            if model not in new_data:
                new_data[model] = []
            new_data[model].append(surp)

    new_data = pd.DataFrame.from_dict(new_data)
    summary = pd.merge(cond_data, new_data, on='sentid')

    summary = pd.melt(summary, id_vars=cond_data.columns,
                  value_vars = list(MODELS),
                  value_name = 'surp',
                  var_name = 'model')
    summary.to_csv(roi_fname, index=False, sep='\t')

conditions_fname = 'data/simple_agrmt_all.tsv'
evaluate_fname = 'predictions/simple_agrmt_all.tsv'
roi_fname = 'results/simple_agrmt_all_byROI.tsv'

create_summary(evaluate_fname, conditions_fname, roi_fname)

forrestdavis commented 1 week ago

It's also quite possible there is a way around this in the way the datafile is made or in the config.

forrestdavis commented 1 week ago

@Brian030601 notes that "For this, I think the new version of the NLPScholar got things mixed up when k-lemma was added. When I used October 6th version of MinimalPair.py instead of the current one, everything worked and I got the probability." So a solution could be reseting your local copy to the commit bf9b5d17fdb96a26a476b9727f13d2be741f199c

grushaprasad commented 1 week ago

Should be fixed with 13b58fb. Running the config file on given text file should now return one line per sentence. Also updated documentation that byROI returns the mean and sum of the specified predictability measure (prob or surp).

forrestdavis / NLPScholar