cdipaolo / sentiment

Simple Sentiment Analysis in Golang
MIT License
269 stars 43 forks source link

Support other languages #1

Closed pengux closed 9 years ago

pengux commented 9 years ago

Could you provide example how to train the library for sentiment analysis in other languages?

cdipaolo commented 9 years ago

Yeah for sure – currently the library is pulling from the imdb review dataset, which means that the words in the map are all english. The functions within data.go expects the format for that dataset (which you can see in /datasets/train.) If you want to convert it over in a drop-in manner you'd have to have a dataset that's labelled within pos and neg folders like the current dataset, where each document/sentence is in its own file.

The best plan of action, because this is made as a drop-in solution and therefore isn't as modular with regards to data source is: fork to your own branch, read through how I add the data to the map[string]Word I use to hold the model info, move the data.go file to pull from your dataset/format, then train off that model, saving into the project dir (use "." as the path in Train(".").)

If you then want to be able to restore easily like you can currently all you need to do is install go-bindata and call it on the words.json model you just made. Now the model should work from your data


I'm sorry it isn't made to be easily extendable. It was a design choice to be able to have a simple, drop-in solution which had no configuration to the end-user. I will be moving the model upstream into my machine learning package, goml, which is designed to be modular and would fit this goal better. I'll write back into this issue when I move it upstream. That would be your easiest and best bet!

Thanks for asking! Conner

PS: if you wanted to have multiple languages set up at the same time, you would need to have a separate classifier to tell which language was used. When I move Naive Bayes into goml you'll be able to have multiple classes with one model (this is currently binary, though it should be relatively simple to set up,) so it'll be even easier for that

cdipaolo commented 9 years ago

I've implemented Multiclass Naive Bayes in a much more modular manner here, in my goml library. You still would need to find some sentiment marked corpus to use in your language. You should look at the examples+docs there if you want to see usage.

Let me know if you have any questions

pengux commented 9 years ago

Thanks, I'll try to add another language and contribute back.

cdipaolo commented 9 years ago

I might go through some refactoring of this package to use goml and wrap that package so I'll let you know what I'm going to do. Should be this week if I do modularize it On Tue, Aug 4, 2015 at 10:56 PM Peter Nguyen notifications@github.com wrote:

Thanks, I'll try to add another language and contribute back.

— Reply to this email directly or view it on GitHub https://github.com/cdipaolo/sentiment/issues/1#issuecomment-127880801.

cdipaolo commented 9 years ago

I'm going to implement Spanish with this twitter dataset, so while I'm waiting for a reply from the conference for access to the dataset I'm going to start a new branch to modularize the platform for a multi-language structure.

cdipaolo commented 9 years ago

BTW the package has now been refactored as of yesterday so that to implement another lang you just need to create a CODE.go (ie. en.go) file which creates the model for the language from the specific dataset and adds it to the language-model map. You just need to add that function to the Train method in init.go, write and run the training during testing, then run go-bindata to bootstrap the Restore() method

let me know if you have any questions

pengux commented 9 years ago

:+1: That's great!