Update train.py - Githubissues

amueller / kaggle_insults

Kaggle Submission for "Detecting Insults in Social Commentary"

150 stars 89 forks source link

Update train.py #1

Closed prhbrt closed 9 years ago

prhbrt commented 9 years ago

I would think auc_score was renamed to roc_auc_score, but couldn't find any proof of it.

amueller commented 9 years ago

True. A lot of this could be simplified now actually.

prhbrt commented 9 years ago

I'm currently reading about convolutional neural networks to do similar text-classification, in my case also insult detection. The fun thing about this is that the convolutional layer is capable of learning correlations between words, and hence is more likely to be able to recognize negations. On the other hand, I'm afraid of overfitting when there's a limited amount of sentences to train on.

amueller commented 9 years ago

Maybe also try LSTMs on word-level? Are you doing character or word-level CNNs?

prhbrt commented 9 years ago

Maybe also try LSTMs on word-level?

I'd have to look into that, "Long Short Term Memory" is something relatively new for me, but something my coworkers should have experience with. Thanks!

Are you doing character or word-level CNNs?

Both, it's called charSCNN, where first a convolutional layer detects local correlations on a character-level, and layer on another convolutional layer detects local correlations on a word level (using a combined/concatenated input of a feature vector for the word and the output of the first convolutional layer). Here's the paper: http://www.aclweb.org/anthology/C14-1008

amueller commented 9 years ago

Ah, I haven't seen that one.

prhbrt commented 9 years ago

A 'third-party' (i.e. not author) implemented the pipeline here: https://github.com/satwantrana/CharSCNN

They stated 70% accuracy, which isn't too fancy I guess.

amueller commented 9 years ago

well depends on the dataset ;)

prhbrt commented 9 years ago

Or possibly a method that is prone to overfitting or underfitting :) But of course, there's a 100% accuracy dataset for each classifier :P