Open li-clement opened 1 year ago
Dataset Name | Description |
---|---|
Yelp reviews – Polarity Dataset | This subset consists of 1,569,264 samples from the 2015 Yelp Dataset Challenge. The different polarity subdatasets contain 280,000 training samples and 19,000 testing samples. The dataset can be found at: https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz |
1 Billion Word Language Model Benchmark R13 Output Dataset | The 1 Billion Word Language Model Benchmark R13 Output Dataset is a new benchmark corpus proposed by Cornell University to measure the progress of statistical language modeling. With nearly 1 billion words of training data, this benchmark test allows for rapid evaluation of novel language modeling techniques and compares their contributions to other state-of-the-art techniques. The dataset was released in 2013 by Cornell University, primarily by Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson. The dataset can be found at: http://www.statmt.org/lm-benchmark/ |
WikiText Long Term Dependency Language Modeling Dataset | The WikiText Long Term Dependency Language Modeling Dataset is a collection of English corpora containing 100 million words. The dataset extracts vocabulary from high-quality articles and benchmark articles from Wikipedia. The dataset includes two versions, WikiText-2 and WikiText-103, which have 2 times and 110 times the vocabulary compared to the famous Penn Treebank (PTB) corpus, respectively. Each vocabulary also retains the original article that generated it, making it suitable for long-term dependency natural language modeling scenarios. The dataset was released in 2016 by Salesforce Research, primarily by Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. The relevant paper is "Pointer Sentinel Mixture Models". The dataset can be found at: https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/ |
Amazon reviews – Polarity Dataset | The Amazon reviews – Polarity dataset is composed of 34,686,770 reviews from 6,643,669 Amazon users on 2,441,053 products. It is a subset of the Amazon reviews – Full dataset and is primarily sourced from the Stanford Network Analysis Project (SNAP). Each sentiment polarity dataset in this subset contains 1,800,000 training samples and 200,000 testing samples. The dataset was released in 2013 by Cornell University. The relevant paper is "Hidden factors and hidden topics: understanding rating dimensions with review text". The dataset can be found at: https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz |
Social Computing Data Repository | The Social Computing Data Repository is a dataset of social network structure data collected from various social networking websites such as BlogCatalog, Buzznet, Delicious, Douban, Flickr, Flixster, Foursquare, Friendster, Hyves, Last.fm, Livemocha, Twitter, and Youtube. It was released in 2009 by Arizona State University. The relevant paper is "Social Computing Data Repository at {ASU}". The dataset can be found at: http://socialcomputing.asu.edu/ |
ADE20K Scene Parsing Dataset | The ADE20K dataset is a large dataset used for scene parsing. It contains 150 object categories and was released and maintained by the MIT CSAIL research group in 2017. It can be used for scene understanding, parsing, segmentation, multi-object recognition, and semantic understanding. The relevant papers are "Scene Parsing through ADE20K Dataset" and "Semantic Understanding of Scenes through ADE20K Dataset". The dataset can be found at: http://groups.csail.mit.edu/vision/datasets/ADE20K/ |
Pascal VOC 2012 | The dataset was released by the Pascol VOC project team in 2012, incorporating the results of previous PASCAL VOC challenges. The PASCAL VOC Challenge, a world-class computer vision competition, officially ended after its last edition in 2012. The PASCAL VOC Challenge (The PASCAL Visual Object Classes) is a network organization funded by the European Union. The challenge was organized by Mark Everingham (University of Leeds), Luc van Gool (ETHZ, Zurich), Chris Williams (University of Edinburgh), John Winn (Microsoft Research Cambridge), and Andrew Zisserman (University of Oxford) and held from 2005 to 2012. The challenge involved problems such as image classification, object detection, object segmentation, human body pose estimation, and action recognition. After the final challenge in 2012, the dataset included a training set of 11,540 images for object classification and detection, and a training set of 2,913 images for object segmentation. Release address: host.robots.ox.ac.uk |
MSMARCO Machine Reading Comprehension Dataset | The MSMARCO dataset is a human-generated machine reading comprehension dataset, consisting of 1,010,916 anonymized questions sampled from Bing's search query logs. Each log has a human-generated answer and 182,669 answers generated through complete manual rewriting. In addition, the dataset includes 8,841,823 passages extracted from 3,563,535 web documents retrieved through Bing search. The MSMARCO dataset was initially released by Microsoft in 2016 and last updated in 2018. The dataset has corresponding ranking competitions. Release address: http://www.msmarco.org/dataset.aspx |
IMDB Large Movie Review Dataset | The IMDB Large Movie Review Dataset is a dataset for sentiment binary classification, aiming to serve as a benchmark for sentiment classification. It includes 25,000 movie reviews for training and 25,000 movie reviews for testing, where the movie reviews exhibit highly polarized sentiment. Additionally, the dataset contains 50,000 unlabeled data available for use. The dataset was released by Stanford University in 2011, with the related paper being "Learning Word Vectors for Sentiment Analysis". Release address: https://s3.amazonaws.com/fast-ai-nlp/imdb.tgz |
pyspark-wordcount | pyspark-wordcount example dataset |
Amazon Fine Food Reviews Dataset | The Amazon Fine Food Reviews dataset consists of reviews for exquisite food products from Amazon, including 568,454 food reviews from Amazon's website until October 2012. The dataset includes user information, review content, reviewed food, food ratings, and other data. It was released by Kaggle in 2013, with the related paper being "From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews". Release address: https://www.kaggle.com/snap/amazon-fine-food-reviews |
Amazon Reviews - Full Dataset | The Amazon Reviews - Full dataset consists of 34,686,770 reviews from 6,643,669 Amazon users for 2,441,053 products. The dataset primarily comes from the Stanford Network Analysis Project (SNAP). Each category of the dataset contains 600,000 training samples and 130,000 testing samples. The dataset was released by Cornell University in 2013, with the related paper being "Hidden factors and hidden topics: understanding rating dimensions with review text". Release address: https://s3.amazonaws.com/fast-ai-nlp/amazon_review_full_csv.tgz |
MACHINE TRANSLATION (WMT16) | WMT16 translation task data from English to German. The training data is a combination of Europarl v7, Common Crawl, and News Commentary v11. The development dataset includes news tests from 2010 to 2015. newstest2016 should be used as the test data. All SGM files have been converted to plain text. Release address: http://www.statmt.org/wmt16/ |
Youtube-8m | The Youtube-8M dataset with segment-level annotations was released in June 2019. It contains approximately 237K video segments with human-verified labels across 1000 categories, collected from the validation set of the Youtube-8M dataset. Each video segment has time-localized frame-level features, allowing classifier predictions at the segment level. Release address: https://research.google.com/youtube8m/download.html |
MovieLens Dataset Movie Recommendation Dataset | The MovieLens dataset contains 13,8493 users' ratings on 27,278 movies, with a total of 20,000,263 movie ratings (1-5 scale). It also includes 465,564 movie tags. The data was collected from the website movielens.umn.edu, covering the time period from January 1995 to March 2015. The MovieLens movie recommendation dataset consists of 943 users' ratings on 1,682 movies, with a total of 100,000 movie ratings (1-5 scale). The data was collected from the website movielens.umn.edu, covering the time period from September 1997 to April 1998. This dataset was published by GroupLens (Department of Computer Science and Engineering, University of Minnesota) in 1998. Publication address: https://grouplens.org/ |
Yahoo! Answers Q&A Dataset | The Yahoo! Answers dataset consists of 10 major categories of data from the Yahoo! Answers Comprehensive Questions and Answers 1.0 dataset. Each category contains 140,000 training samples and 5,000 testing samples. The Yahoo! Answers Comprehensive Questions and Answers 1.0 dataset is a collection of answers corpus as of October 25, 2007. It includes all questions and their corresponding answers. The distributed corpus contains 4,483,032 questions and their answers. In addition to the question and answer texts, the corpus also includes a small amount of metadata, such as the selection of the best answer and the category and subcategory assigned to the question. The Yahoo! Answers dataset was released by Cornell University. Publication address: https://s3.amazonaws.com/fast-ai-nlp/yahoo_answers_csv.tgz |
MS COCO | The COCO dataset is a large-scale image dataset designed for object detection, segmentation, human keypoint detection, stuff segmentation, and caption generation in the field of computer vision. The COCO dataset focuses on scene understanding and is primarily extracted from complex daily scenes, with precise object localization through segmentation.The COCO dataset has several features: object segmentation, perception in scenes, superpixel segmentation, 330,000 images (over 200,000 labeled), 1.5 million object instances, 80 object categories, 91 thing categories, and 250,000 human keypoints.The COCO dataset was released by Microsoft in 2014 and has since become a standard benchmark for image captioning. Publication address: cocodataset.org |
European Parliament Proceedings Parallel Corpus 1996-2011 Statistical Machine Translation Corpus | The European Parliament Proceedings Parallel Corpus 1996-2011 dataset is a statistical machine translation corpus. The Europarl parallel corpus is extracted from proceedings of the European Parliament and includes 21 European language versions:Romance languages (French, Italian, Spanish, Portuguese, Romanian)Germanic languages (English, Dutch, German, Danish, Swedish)Slavic languages (Bulgarian, Czech, Polish, Slovak, Slovenian)Finnougric languages (Finnish, Hungarian, Estonian)Baltic languages (Latvian, Lithuanian)Greek, The European Parliament Proceedings Parallel Corpus 1996-2011 dataset was initially released in 2005 by the School of Informatics at the University of Edinburgh, with Philipp Koehn as the main publisher. A 7th version was released in 2012, and the related paper is "Europarl: A Parallel Corpus for Statistical Machine Translation". Publication address: http://www.statmt.org/europarl/ |
SynthText Natural Scene Image Dataset | The SynthText dataset is composed of natural scene images containing words, mainly used for text detection in natural scenes. The SynthText dataset consists of 800,000 images with approximately 8 million synthesized word instances. Each text instance is annotated with its corresponding text string, word-level, and character-level bounding boxes.The SynthText dataset was released in 2016 by the Visual Geometry Group of the Engineering Science Department at the University of Oxford, by Gupta, A. and Vedaldi, A. and Zisserman, A. at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). The dataset can be found at the following address: http://www.robots.ox.ac.uk/~vgg/data/scenetext/ |
Netflix movie rating dataset | The Netflix movie rating dataset is a collection of movie ratings. It consists of ratings from 480,000 random Netflix customers for over 17,000 movies, with over 1 million ratings. The data covers the time period from October 1998 to November 2005. The ratings are based on a 5-point scale, with each movie rated from 1 to 5. Customer information has been anonymized. This dataset comes from the Netflix Prize, which aimed to significantly improve the accuracy of movie recommendations based on a person's own enjoyment of films. The competition ran from 2006 to 2011. The dataset can be found at the following address: https://www.netflixprize.com/ |
The WMT 2015 French/English parallel texts dataset | The WMT 2015 French/English parallel texts dataset is a collection of parallel French/English texts used for training translation models. It consists of over 20 million sentences in French and English. The dataset was created by Chris Callison-Burch who crawled millions of web pages and used a set of simple heuristics to turn French URLs into English URLs, assuming that these documents are translations of each other. The dataset was released in 2009 by Johns Hopkins University, University of Edinburgh, and University of Amsterdam. The dataset can be found at the following address: https://s3.amazonaws.com/fast-ai-nlp/giga-fren.tgz |
The CORe50 dataset | The CORe50 dataset is a new dataset and benchmark for continuous object recognition. It is primarily used for evaluating continuous learning techniques in object recognition environments, as well as some baseline methods for three different continuous learning scenarios. The dataset was released in 2017 by the University of Bologna, with Vincenzo Lomonaco and Davide Maltoni as the main contributors. The related paper is titled "CORe50: a new Dataset and Benchmark for continual Object Recognition". The dataset can be found at the following address: https://vlomonaco.github.io/core50/ |
word2vec_300 | word2vec_300 is a pre-trained Chinese word2vec model with 300 dimensions. The model was trained on a mixture of Wikipedia and Common Crawl datasets using fastText. |