Regarding HAN implementation

AlexGidiotis / Document-Classifier-LSTM

A bidirectional LSTM with attention for multiclass/multilabel text classification.

MIT License

171 stars 52 forks source link

Regarding HAN implementation #4

Closed spartian closed 5 years ago

spartian commented 5 years ago

The HAN paper considers only words that appears more than 5 times. I don't think it is implemented in the code. Also, does stop word removal take place in the paper? As I mentioned if stop words are repeated 5 or more times then event they have to be considered. What are your views on this?

AlexGidiotis commented 5 years ago

You are right. Using frequency thresholds some times helps but restricting the vocabulary (e.g. to 80k words) usually does the same job.
Removing stopwords sometimes hurts the performance of lstms but since this is not a language model or summariser I don't thing it is such a big deal.

You are welcome to try both those modifications and let us know if they actually improved the performance.

spartian commented 5 years ago

One more question before closing the issue: Has word attention been implemented in HAN ? I asked this because I can see two BILSTM layers in HAN but first a sentence BILSTM is used then word BILSTM is used. Shouldn't it be reverse? or am I missing something?

AlexGidiotis commented 5 years ago

That's just me naming the layers in a weird way. The first "sent_blstm" run on a word level and encodes each sentence and then "blstm" runs on a sentence level and encodes the whole document.

spartian commented 5 years ago

Thank you for clearing that out......