Closed forrestdavis closed 1 week ago
A temporary solution, for single ROIs (like 2 not 2,3) is copied below. Simply change the file names at the bottom.
import pandas as pd
def token_to_word(preddat):
# Change word positions of punctuation
handle_punctuation(preddat)
# Summarize over word position
groupby_cols = ['sentid', 'wordpos_mod', 'model']
summ = preddat.groupby(groupby_cols).agg({'surp': 'sum'}).reset_index() #note prob is always mean, cannot be sum.
# Realign wordpositions (remove gaps from handling punctuation)
summ = summ.groupby(['model']).apply(remove_gaps, colname = 'wordpos_mod', include_groups=False).reset_index()
return summ
def remove_gaps(grouped_df, colname):
prev=-1 #manually keep track because row indices not consecutive in grouped dataframe
for i, row in grouped_df.iterrows():
if prev != -1: #skip first row in grouped_df
diff = row[colname]-grouped_df.loc[prev, colname]
if diff>1:
grouped_df.loc[i,colname]-=(diff-1)
prev=i
return grouped_df
def handle_punctuation(dat):
dat['wordpos_mod'] = dat['wordpos']
dat.loc[dat['punctuation']==True, 'wordpos_mod'] = dat.loc[dat['punctuation']==True, 'wordpos']-1
def create_summary(evaluate_fname, conditions_fname,
roi_fname):
cond_data = pd.read_csv(conditions_fname, sep='\t')
eval_data = pd.read_csv(evaluate_fname, sep='\t')
by_word = token_to_word(eval_data)
new_data = {}
MODELS = set()
for _, row in cond_data.iterrows():
sentid = row['sentid']
ROI = row['ROI']
surp = by_word[by_word['sentid'] == sentid]
if 'sentid' not in new_data:
new_data['sentid'] = []
new_data['sentid'].append(sentid)
surp = surp[surp['wordpos_mod'] == ROI]
models = surp['model'].tolist()
surps = surp['surp'].tolist()
for model, surp in zip(models, surps):
MODELS.add(model)
if model not in new_data:
new_data[model] = []
new_data[model].append(surp)
new_data = pd.DataFrame.from_dict(new_data)
summary = pd.merge(cond_data, new_data, on='sentid')
summary = pd.melt(summary, id_vars=cond_data.columns,
value_vars = list(MODELS),
value_name = 'surp',
var_name = 'model')
summary.to_csv(roi_fname, index=False, sep='\t')
conditions_fname = 'data/simple_agrmt_all.tsv'
evaluate_fname = 'predictions/simple_agrmt_all.tsv'
roi_fname = 'results/simple_agrmt_all_byROI.tsv'
create_summary(evaluate_fname, conditions_fname, roi_fname)
It's also quite possible there is a way around this in the way the datafile is made or in the config.
@Brian030601 notes that "For this, I think the new version of the NLPScholar got things mixed up when k-lemma was added. When I used October 6th version of MinimalPair.py instead of the current one, everything worked and I got the probability." So a solution could be reseting your local copy to the commit bf9b5d17fdb96a26a476b9727f13d2be741f199c
Should be fixed with 13b58fb. Running the config file on given text file should now return one line per sentence. Also updated documentation that byROI returns the mean and sum of the specified predictability measure (prob or surp).
Describe the bug
The byROI file is stated to be a "TSV file with one row for each sentence. The columns are the mean and probability and surprisal aggregated over the ROI for the corresponding sentence." Presently the probability isn't returned and, more importantly, some sentences are dropped. So we are not getting a row for each sentence in a transparent way. I assume there is some aggregation happening over lemmas or conditions. It would be good to minimally surface the sentid so the output could be aligned. This was pointed out to me by @gia-kalro
To Reproduce
Steps to reproduce the behavior:
main.py
res_simple_agrmt_all.tsv
that there is not one row for each sentenceExpected behavior
Expected one row per sentence
Observed behavior
Missing rows
Setup (please complete the following information)