SangitaNLP / sangita

A Natural Language Toolkit for Indian Languages
Apache License 2.0
40 stars 41 forks source link

Extraction BenLem dataset #16

Closed djokester closed 3 years ago

djokester commented 6 years ago

Extraction of Word, Lemma pairs from the BenLem dataset. Citation: A. Chakrabarty and U. Garain (2015): BenLem (a Bengali Lemmatizer) and its Role in WSD, in ACM Trans. Asian and Low-Resource Language Information Processing (TALIIP).

Sai-Adarsh commented 5 years ago

@djokester interested, could you help me with it ?

djokester commented 5 years ago

@Sai-Adarsh sure. Go to the data set given in the link above. Extract the zipped folder. Inside folder you would find a folder titled lematisation_dataset which has a lot of text files which contain the word, its Part of Speech and its lemma. Extract the word and the lemma pairs for words only ( no numbers or symbols) For Nouns (tagged NNP or NN or any other noun tag) which have EXCLUDED given under lemma, store the original word as the lemma. Store the results from all the files in a single csv file and share it with me.