Discovering Datasets - Githubissues

aswanipranjal commented 6 years ago

The repo currently doesn’t have a specific Hind Corpora to work on. We are looking for a corpora which satisfies the following points:-

Type of Datasets: We are looking for something that can be used to train the stemmer, lemmatiser, tokenizer, POS tagger and the named entity recognizer on.
- POS tagger: The corpora should have sentences with the parts of speech or “Shabdo ke prakaar” tagged in them. Read about what POS in Hindi is here and here.
- We will also need a wordnet, something like this: http://www.cfilt.iitb.ac.in/wordnet/webhwn/hindi_examples.php
- Dataset containing stems of words along with different forms in which that word can be represented for the stemmer.
Restrictions on the dataset:
- The dataset should be available to be used freely without any restrictions and to open source that Corpora. It can be under Open Database License (ODbL) v1.0 which allows free use of that data.
- Or better: http://www.wtfpl.net/

Some of the datasets that can be used might be available with the LTRC committee in IIT-B. This issue is about discovering good Hindi Corpora for this project. Participants and contributors can search for and create PRs adding the datasets and links to the datasets.

Guidelines before sending Pull Requests:

This will be a issue with variable difficulty. You score points depending on the difficulty of the data extraction.
Firstly you need to comment the link of the dataset on this issue along with details about the data and it's licensing.
Once a mentor approves it you need to add the dataset to Sangita Data. A mentor will assist you in this task.
You can use alternative methods like web scraping to generate data yourself.

Here is a rough outline of the requirements.

[x] WordNet
[x] Word, Lemma Pairs
[x] Word, POS pairs
[x] Others

uemarica commented 6 years ago

Hello! I found datasets at: http://wortschatz.uni-leipzig.de/en/download/ -> All languages -> search for hindi It has got a set of files which can be used for free.

MansiBreja commented 6 years ago

Hey! The link to the data set is as follows: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-6260-A# The licensing information is as follows: https://creativecommons.org/licenses/by-nc-sa/3.0/ Details of the dataset on the third file downloadable from the link are: HindMonoCorp 0.5 segmented and tokenized, with automatic morphological tags and lemmas

ighosh98 commented 6 years ago

Links to a few Hindi Datasets that I found:

AnkitaMohanty commented 6 years ago

Hello! Here is the link to the Hindi dataset that I found: http://homepages.inf.ed.ac.uk/miles/babel.html

MansiBreja commented 6 years ago

Another link I found is as follows: http://lrec2016.lrec-conf.org/en/shared-lrs/ The description can be searched by finding the word "hindi" on this page. Link to the dataset: http://lrec2016.lrec-conf.org/sharedlrs2016/698_res_1.xml The license is "open source". But the description reads "sentiment analysis" and hence I'm not sure if this is useful or not. PFA the ss of the description. screenshot 177

djokester commented 6 years ago

@uemarica and @MansiBreja kindly download the datasets and let me know what it contains. @AnkitaMohanty parallel corpus won't be useful now but would be used in Machine Translation later. @ighosh98 LTRC and CFILT have already been discussed.

uemarica commented 6 years ago

The datasets contains hundreds of thousands hindi words,phrases,sentences from news,web,wikipedia.

aswanipranjal commented 6 years ago

@uemarica can you elaborate a bit? Maybe provide a sample of the dataset? Does the dataset contain the outline that has been provided above in the issue?

djokester commented 6 years ago

@mayankk98 I need an update whether or not we can progress with this.

mayankk98 commented 6 years ago

We can progress with it using other available data sets if possible considering I'm having issues with hindi word net

djokester commented 6 years ago

I am closing that other issue for now. Want a new task? @mayankk98

djokester commented 6 years ago

WordNet and HindiMonoCorp put on hold. They have their individual issues on SangitaNLP later on. Will reopen WordNet in the future.

SangitaNLP / sangita

Discovering Datasets #8