Toloka / crowd-kit

Control the quality of your labeled data with the Python tools you already know.
https://crowd-kit.readthedocs.io/
Other
213 stars 16 forks source link

[DOCS] test datasets #98

Closed Mind-the-Cap closed 9 months ago

Mind-the-Cap commented 9 months ago

Problem description

Hi, I'm reviewing the library for JOSS https://github.com/openjournals/joss-reviews/issues/6227

In your tutorials, you suggest using labeled_train_data.tsv, however I cannot find this data. Is is provided somewhere?

Thanks!

Documentation links

https://github.com/Toloka/crowd-kit/blob/main/examples/ECIR2023-Intents.ipynb

Potential fix suggestion

No response

dustalov commented 9 months ago

Hi, thank you! We can try recovering it, let me ask my colleagues.

dustalov commented 9 months ago

@denaxen @aliskin @pilot7747 I could not find the ECIR '23 tutorial dataset at https://github.com/clinc/oos-eval/tree/master/data and https://toloka.ai/events/ecir-tutorial-2023/. Maybe we have these files, labeled_train_data.tsv and labeled_test_data.tsv, anywhere in Toloka's Google Drive or S3?

dustalov commented 9 months ago

We found the files in the supplementary materials for our tutorial: https://drive.google.com/drive/folders/1jMNkCs1DzJESiL2-8Pr8NJhqyi7GZJp5?usp=sharing (Labeled data subdirectory). I can adjust the notebook accordingly.

dustalov commented 9 months ago

@Mind-the-Cap fixed in e8086abfd10379db7cafe0276a6e733fbf7c417c, now the files are downloaded using gdown.

Mind-the-Cap commented 9 months ago

Great, thanks!