Reproduce results from paper

Tencent / NeuralNLP-NeuralClassifier

An Open-source Neural Hierarchical Multi-label Text Classification Toolkit

Other

1.85k stars 406 forks source link

Reproduce results from paper #22

Closed awavefunction closed 5 years ago

awavefunction commented 5 years ago

Thank you for making this code open and available to the community. It is easy to use. My issue is that I do not obtain the same results that you present in Table 4 of your paper.

When I train the TextCNN model on the RCV1 data set using the parameters you provide in config/train.json, I obtain a micro F1 score of 0.739. When I set hierarchical=false, I obtain a micro F1 score of 0.732. Your table shows a micro F1 score of 0.761 (hierarchical) and 0.737 (flat).

Similarly, when I train the TextRNN model using the default configuration file with model_name=TextRNN, I find a micro F1 score of 0.793 with hierarchical loss, and a micro F1 score of 0.792 without hierarchical loss. Your table shows a micro F1 score of 0.789 (hierarchical) and 0.755 (flat).

Are you able to directly reproduce Table 4 with the configuration in this repo, or is your configuration different (and if so, can you share it)?

awavefunction commented 5 years ago

I find that the code is not deterministic, i.e. the results vary from run to run, probably due to well known problems with random seeds in PyTorch. I can obtain deterministic results if I set num_workers=1 and device=cpu with PyTorch version 1.2. With these settings, I obtain:

TextCNN on RCV1, hierarchical=true: F1 = 0.731 TextCNN on RCV1, hierarchical=false: F1 = 0.738 TextRNN on RCV1, hierarchical=true: F1 = 0.785 TextRNN on RCV1, hierarchical=false: F1 = 0.786

These numbers should be reproducible by anyone else that uses num_workers=1 and device=cpu. But the results still differ from the paper.

coderbyr commented 5 years ago

I find that the code is not deterministic, i.e. the results vary from run to run, probably due to well known problems with random seeds in PyTorch. I can obtain deterministic results if I set num_workers=1 and device=cpu with PyTorch version 1.2. With these settings, I obtain:

TextCNN on RCV1, hierarchical=true: F1 = 0.731 TextCNN on RCV1, hierarchical=false: F1 = 0.738 TextRNN on RCV1, hierarchical=true: F1 = 0.785 TextRNN on RCV1, hierarchical=false: F1 = 0.786

These numbers should be reproducible by anyone else that uses num_workers=1 and device=cpu. But the results still differ from the paper. 1、The RCV1 dataset has different versions for different task, maybe you are using a different one. In this paper, the train set has 23149 instances, while test set has 781264 instances. 2、The best result of TextCNN is produced by using public 300dims pre-trained token-embedding(https://nlp.stanford.edu/projects/glove/) like other papers. you can try it .

ayushbits commented 4 years ago

@amulder are you able to verify the results from their paper ? We are facing the same problem of reproducing using TextCNN and TextRCNN.