PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.07k stars 5.54k forks source link

Paddle demo中文数据集 #981

Closed reyoung closed 7 years ago

reyoung commented 7 years ago

Related #176

为了更好的做Paddle的demo、教程,需要有中文的数据集。数据集的获取方法可以是自己标注,也可以是找公开的数据集。

可能的中文数据集有:

llxxxll commented 7 years ago

uci数据集:http://archive.ics.uci.edu/ml/index.html kaggle平台数据集:https://www.kaggle.com/datasets

@beckett1124 贡献的两个数据集参考

llxxxll commented 7 years ago

Image classification ImageNet: http://image-net.org/challenges/LSVRC/2016/ CIFAR-10 and CIFAR-100: https://www.cs.toronto.edu/~kriz/cifar.html MNIST: http://yann.lecun.com/exdb/mnist/ SVHN: http://ufldl.stanford.edu/housenumbers/ CUB-200: http://www.vision.caltech.edu/visipedia/CUB-200.html

Object detection: ImageNet: http://image-net.org/challenges/LSVRC/2016/ PASCAL VOC: http://host.robots.ox.ac.uk/pascal/VOC/ KITTI: http://www.cvlibs.net/datasets/kitti/ MS-COCO: http://mscoco.org/dataset/

Segmentation: ImageNet: http://image-net.org/challenges/LSVRC/2016/ PASCAL VOC: http://host.robots.ox.ac.uk/pascal/VOC/ KITTI: http://www.cvlibs.net/datasets/kitti/ MS-COCO: http://mscoco.org/dataset/ Cityscapes: https://www.cityscapes-dataset.com/ PASCAL-Part: http://www.stat.ucla.edu/~xianjie.chen/pascal_part_dataset/pascal_part.html PASCAL-Context: http://www.cs.stanford.edu/~roozbeh/pascal-context/ CamVid: http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/

Image caption MS-COCO: http://mscoco.org/dataset/ Flickr 8K: http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/KCCA.html Flickr 30k: http://shannon.cs.illinois.edu/DenotationGraph/ IAPR TC-12: http://imageclef.org/photodata

Question answering DAQUAR: http://www.cs.toronto.edu/~mren/imageqa/results/ COCO-QA: http://www.cs.toronto.edu/~mren/imageqa/data/cocoqa/ Visual Genome: https://visualgenome.org/

Saliency: MIT300: http://saliency.mit.edu/results_mit300.html CAT2000: http://saliency.mit.edu/results_cat2000.html MSRA10K: http://mmcheng.net/msra10k/ ECSSD: http://www.cse.cuhk.edu.hk/leojia/projects/hsaliency/dataset.html

Video summarization SumMe: https://people.ee.ethz.ch/~gyglim/vsum/#benchmark TVSum: https://github.com/yalesong/tvsum

@hohdiy 贡献的数据集参考

pengli09 commented 7 years ago

中文完形填空数据集:https://github.com/ymcui/Chinese-RC-Dataset

Zrachel commented 7 years ago

上面@reyoung提的:中文的看图说话数据,是没有中文数据的;但看图问话是有的,见http://idl.baidu.com/FM-IQA.html

此外还需要 中文语音识别corpus(THCHS-30 : A Free Chinese Speech Corpus貌似可用,待调研) 中文语料库(类似 1 Billion Word Language Model Benchmark) 中英翻译(类似WMT) 中文序列标注(类似CoNLL-2005&2012)

luotao1 commented 7 years ago

@llxxxll 在 @Zrachel 的回复中已经提到需要中英翻译的数据集了。wmt法英翻译数据集,主要以新闻语料为主,其中训练样本集有超过1200万条的平行语料。同时,根据 @lcy-seso 的经验,中英翻译如果少于100万条的平行语料,很难训练出一个比较好的模型。

livc commented 7 years ago

THUOCL:清华大学开放中文词库 近日开源,供参考。

pengli09 commented 7 years ago

如果类似THUOCL这种语料能用的话,那http://thunlp.org/site2/index.php/en/resources 这里还有几个

livc commented 7 years ago

发现一个古诗的数据集。

最全中华古诗数据库, 唐宋两朝近一万四千古诗人, 接近5.5万首唐诗加26万宋诗. https://github.com/jackeyGao/chinese-poetry

JiayiFeng commented 7 years ago

Close this inactivate issue, please feel free to reopen.