Closed aswanipranjal closed 6 years ago
Hello! I found datasets at: http://wortschatz.uni-leipzig.de/en/download/ -> All languages -> search for hindi It has got a set of files which can be used for free.
Hey! The link to the data set is as follows: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-6260-A# The licensing information is as follows: https://creativecommons.org/licenses/by-nc-sa/3.0/ Details of the dataset on the third file downloadable from the link are: HindMonoCorp 0.5 segmented and tokenized, with automatic morphological tags and lemmas
Links to a few Hindi Datasets that I found:
Hello! Here is the link to the Hindi dataset that I found: http://homepages.inf.ed.ac.uk/miles/babel.html
Another link I found is as follows: http://lrec2016.lrec-conf.org/en/shared-lrs/ The description can be searched by finding the word "hindi" on this page. Link to the dataset: http://lrec2016.lrec-conf.org/sharedlrs2016/698_res_1.xml The license is "open source". But the description reads "sentiment analysis" and hence I'm not sure if this is useful or not. PFA the ss of the description.
@uemarica and @MansiBreja kindly download the datasets and let me know what it contains. @AnkitaMohanty parallel corpus won't be useful now but would be used in Machine Translation later. @ighosh98 LTRC and CFILT have already been discussed.
The datasets contains hundreds of thousands hindi words,phrases,sentences from news,web,wikipedia.
@uemarica can you elaborate a bit? Maybe provide a sample of the dataset? Does the dataset contain the outline that has been provided above in the issue?
@mayankk98 I need an update whether or not we can progress with this.
We can progress with it using other available data sets if possible considering I'm having issues with hindi word net
I am closing that other issue for now. Want a new task? @mayankk98
WordNet and HindiMonoCorp put on hold. They have their individual issues on SangitaNLP later on. Will reopen WordNet in the future.
The repo currently doesn’t have a specific Hind Corpora to work on. We are looking for a corpora which satisfies the following points:-
Some of the datasets that can be used might be available with the LTRC committee in IIT-B. This issue is about discovering good Hindi Corpora for this project. Participants and contributors can search for and create PRs adding the datasets and links to the datasets.
Guidelines before sending Pull Requests:
Here is a rough outline of the requirements.
[x] WordNet
[x] Word, Lemma Pairs
[x] Word, POS pairs
[x] Others