bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset vicon_visim400 #126

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago
albertvillanova commented 2 years ago

DONE: https://huggingface.co/datasets/bigscience-catalogue-data/vicon_visim400

albertvillanova commented 2 years ago

It needs support for ZIP:

albertvillanova commented 2 years ago

ERROR:


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 23: invalid start byte
albertvillanova commented 2 years ago

These datasets just contain pairs of words:

I don't think these are appropriate to train a Language Model. CC: @yjernite