alvations / stasis

Semantic Textual Similarity in Python
80 stars 20 forks source link

empty lines #1

Open adirc opened 8 years ago

adirc commented 8 years ago

There is a lot of empty lines in the gs files - /STS2015-gold/STS.gs.headlines.txt for example. is it means something? or just the label is missing?

alvations commented 8 years ago

I had the same question too when I tried to play around the dataset. The empty lines just means that they are not annotated by the STS organizers. Not all sentence pairs are used in the evaluation.

Daniel Cer explains on https://groups.google.com/d/msg/sts-semeval/js-Y0e92YuM/jJUi5beJBwAJ

alvations commented 8 years ago

To slurp the STS data into a sframe dataframe, I usually do this:

import sframe
# Reads STS2012-2015 dataset.
sts_train = sframe.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str], quote_char='\0')
# Throw the sentence pairs with empty annotations.
sts_train = sts_train.dropna(columns=['Score'])

Take a look at https://github.com/alvations/stasis/blob/master/notebooks/SWORD.ipynb and https://github.com/alvations/stasis/blob/master/notebooks/SHIELD.ipynb for more details =)