Add NetEaseCrowd dataset

Toloka / crowd-kit

Control the quality of your labeled data with the Python tools you already know.

https://crowd-kit.readthedocs.io/

Other

213 stars 16 forks source link

Add NetEaseCrowd dataset #101

Closed shenxiangzhuang closed 8 months ago

shenxiangzhuang commented 8 months ago

Checklist

[x] I have read the CONTRIBUTING document
[x] I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
[ ] My change requires a change to the documentation
[ ] I have updated the documentation accordingly
[x] I have added tests to cover my changes
[x] All new and existing tests passed

Dataset info

Adding our open-source dataset, NetEaseCrowd(https://github.com/fuxiAIlab/NetEaseCrowd-Dataset).

NetEaseCrowd is a large-scale dataset for long-term and online crowdsourcing truth inference, which contains about 2,400 workers, 1,000,000 tasks, and 6,000,000 annotations collected over 6 months. We believe that this dataset could be an invaluable asset to the Toloka/crowd-kit community by providing a new benchmark for crowdsourcing-related research and development.

codecov-commenter commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.96%. Comparing base (07c4240) to head (08440a2). Report is 34 commits behind head on main.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #101 +/- ## ========================================== + Coverage 92.80% 92.96% +0.15% ========================================== Files 47 47 Lines 2070 2216 +146 ========================================== + Hits 1921 2060 +139 - Misses 149 156 +7 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

shenxiangzhuang commented 8 months ago

Besides the CI test, I also tested to use this dataset do categorical aggregation and it works well:

from crowdkit.aggregation import DawidSkene
from crowdkit.datasets import load_dataset

df, gt = load_dataset('netease_crowd')

ds = DawidSkene(10)
result = ds.fit_predict(df)

print(len(result))
# 999799

shenxiangzhuang commented 8 months ago

Thank you for a very well-done PR! I noticed a small imperfection in the dataset metadata. Could you please check my suggestion?

Thanks a lot for your carefully review!

dustalov commented 8 months ago

Great job, thank you again!