Open adirc opened 8 years ago
I had the same question too when I tried to play around the dataset. The empty lines just means that they are not annotated by the STS organizers. Not all sentence pairs are used in the evaluation.
Daniel Cer explains on https://groups.google.com/d/msg/sts-semeval/js-Y0e92YuM/jJUi5beJBwAJ
To slurp the STS data into a sframe
dataframe, I usually do this:
import sframe
# Reads STS2012-2015 dataset.
sts_train = sframe.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str], quote_char='\0')
# Throw the sentence pairs with empty annotations.
sts_train = sts_train.dropna(columns=['Score'])
Take a look at https://github.com/alvations/stasis/blob/master/notebooks/SWORD.ipynb and https://github.com/alvations/stasis/blob/master/notebooks/SHIELD.ipynb for more details =)
There is a lot of empty lines in the gs files - /STS2015-gold/STS.gs.headlines.txt for example. is it means something? or just the label is missing?