SangitaNLP / sangita

A Natural Language Toolkit for Indian Languages
Apache License 2.0
40 stars 41 forks source link

Discovering Datasets #8

Closed aswanipranjal closed 6 years ago

aswanipranjal commented 6 years ago

The repo currently doesn’t have a specific Hind Corpora to work on. We are looking for a corpora which satisfies the following points:-

Some of the datasets that can be used might be available with the LTRC committee in IIT-B. This issue is about discovering good Hindi Corpora for this project. Participants and contributors can search for and create PRs adding the datasets and links to the datasets.

Guidelines before sending Pull Requests:

Here is a rough outline of the requirements.

uemarica commented 6 years ago

Hello! I found datasets at: http://wortschatz.uni-leipzig.de/en/download/ -> All languages -> search for hindi It has got a set of files which can be used for free.

MansiBreja commented 6 years ago

Hey! The link to the data set is as follows: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-6260-A# The licensing information is as follows: https://creativecommons.org/licenses/by-nc-sa/3.0/ Details of the dataset on the third file downloadable from the link are: HindMonoCorp 0.5 segmented and tokenized, with automatic morphological tags and lemmas

ighosh98 commented 6 years ago

Links to a few Hindi Datasets that I found:

  1. https://ltrc.iiit.ac.in/download.php
  2. http://www.cfilt.iitb.ac.in/Downloads.html
  3. Hindi Treebank: http://ltrc.iiit.ac.in/treebank_H2014/
AnkitaMohanty commented 6 years ago

Hello! Here is the link to the Hindi dataset that I found: http://homepages.inf.ed.ac.uk/miles/babel.html

MansiBreja commented 6 years ago

Another link I found is as follows: http://lrec2016.lrec-conf.org/en/shared-lrs/ The description can be searched by finding the word "hindi" on this page. Link to the dataset: http://lrec2016.lrec-conf.org/sharedlrs2016/698_res_1.xml The license is "open source". But the description reads "sentiment analysis" and hence I'm not sure if this is useful or not. PFA the ss of the description. screenshot 177

djokester commented 6 years ago

@uemarica and @MansiBreja kindly download the datasets and let me know what it contains. @AnkitaMohanty parallel corpus won't be useful now but would be used in Machine Translation later. @ighosh98 LTRC and CFILT have already been discussed.

uemarica commented 6 years ago

The datasets contains hundreds of thousands hindi words,phrases,sentences from news,web,wikipedia.

aswanipranjal commented 6 years ago

@uemarica can you elaborate a bit? Maybe provide a sample of the dataset? Does the dataset contain the outline that has been provided above in the issue?

djokester commented 6 years ago

@mayankk98 I need an update whether or not we can progress with this.

mayankk98 commented 6 years ago

We can progress with it using other available data sets if possible considering I'm having issues with hindi word net

djokester commented 6 years ago

I am closing that other issue for now. Want a new task? @mayankk98

djokester commented 6 years ago

WordNet and HindiMonoCorp put on hold. They have their individual issues on SangitaNLP later on. Will reopen WordNet in the future.