PhilipMay / stsb-multi-mt

Machine translated multilingual STS benchmark dataset.
Other
24 stars 8 forks source link

ja dataset seems to have one line without text #1

Open PhilipMay opened 3 years ago

PhilipMay commented 3 years ago

See here

/home/mike/.cache/huggingface/datasets/downloads/f12d10e84012b8e838b51ba4b122f22ffaecca1178a0925e88b8aefd5579c034 {'sentence1': '', 'sentence2': '', 'similarity_score': '1.2'}
Traceback (most recent call last):     
  File "x.py", line 5, in <module>
    dataset = load_dataset('stsb_multi_mt', LANG, split='train')
  File "/home/mike/dev/fork/datasets/src/datasets/load.py", line 741, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/mike/dev/fork/datasets/src/datasets/builder.py", line 578, in download_and_prepare
    self._download_and_prepare(
  File "/home/mike/dev/fork/datasets/src/datasets/builder.py", line 656, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/mike/dev/fork/datasets/src/datasets/builder.py", line 976, in _prepare_split
    for key, record in utils.tqdm(
  File "/home/mike/miniconda3/envs/datasets-dev/lib/python3.8/site-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/home/mike/.cache/huggingface/modules/datasets_modules/datasets/stsb_multi_mt/c37541c03e69560a27343e49c46f9aac2701c64c1b712c6e21ab9791c87eb43e/stsb_multi_mt.py", line 144, in _generate_examples
    assert len(row['sentence1'].strip()) > 0
AssertionError
PhilipMay commented 3 years ago

link to line with issue https://github.com/PhilipMay/stsb-multi-mt/blob/4637091b072bcfc59bca5b9d03e73db4ed95894b/data/stsb-ja-train.csv#L2430