RandolphVI / Multi-Label-Text-Classification

About Muti-Label Text Classification Based on Neural Network.
Apache License 2.0
553 stars 143 forks source link

data for different model #4

Closed chensuim closed 4 years ago

chensuim commented 5 years ago

Hello, did you measure the parameters about your all models? like precision, recall and so on. If yes, could you please share it?

Thanks a lot, Sui

RandolphVI commented 5 years ago

Hi Sui,

I have measured the evaluation metrics you mentioned, but it has been a while since this repository created about one year ago.

I didn't record the results at that time, I'm happy to accept your request and I will add the experiment results within a month(A little busy lately, sorry for that😅).

But if you are urgent to know the performance of all the models(maybe you just want to know what model is the best), I remember in my dataset, the model performance is: CRNN > SANN/HAN/RCNN > CNN > RNN > ANN > FastText (Note that my dataset almost consists of the Chinese words.)

Hope this helps!

Randolph

chensuim commented 5 years ago

Thanks a lot! Do you remember the rough performance in number of CRNN?

Sui

RandolphVI commented 5 years ago

@chensuim

In my dataset, the F1 value of CRNN performance is about 0.69.

I suggest that you can try the CRNN and SANN model which performs well in my dataset, in my memory.

Randolph

chensuim commented 5 years ago

@RandolphVI Thank you. Is your dataset open source? I tried them on my long tail dataset which can only get f1 at 0.5.

Sui

RandolphVI commented 5 years ago

@chensuim

Sorry, my dataset is not an open source. Have you tried to padding the sentence length? (which I think influence a lot)

If your dataset comprises English words almost and all sentence is large than the 200 words, I will suggest you use the LSTM-based method rather than using the CNN-based approach.

chensuim commented 5 years ago

@RandolphVI No worries. I used padding. In my dataset, seq length changed a lot (from 10 words to more than 300 words). I pad the length to 200. I have already tried both lstm and cnn. You are right. Lstm is better but still can only reach f1 at 0.5. I thought it was the result of long tail. Do you have any suggestion about the way to deal with long tail data?

Sui

RandolphVI commented 5 years ago

@chensuim

In this condition, in my opinion, to deal with the long tail you need to design a sampling strategy for remaking the better dataset since it's the problem of the data.

I deal with the dataset just like yours and my solution is cleaning up the data 😂.

chensuim commented 5 years ago

@RandolphVI Sorry, I dont know how to clean the data. I thought long tail was a general question for all multilabel questions.

Sui